POC: Notification manager#920
Conversation
Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive notification system to AutoHCK, enabling external hooks into project, engine, and test lifecycles via a new NotificationManager and ExternalScript provider. The review feedback highlights several areas for improvement: addressing encapsulation violations when accessing internal object states, ensuring proper resource cleanup by implementing a close method in the NotificationManager, and enhancing the portability and maintainability of the script execution and serialization logic.
|
|
||
| sig { params(studio: HCKStudio).returns(T::Hash[String, T.untyped]) } | ||
| def hck_studio_payload(studio) | ||
| { 'class' => 'HCKStudio', 'engine_tag' => studio.instance_variable_get(:@tag) } |
There was a problem hiding this comment.
Accessing @tag directly using studio.instance_variable_get(:@tag) violates encapsulation. If engine_tag is intended to be a public attribute, it should be exposed via a public reader method (e.g., studio.engine_tag). If it's meant to be private, then HCKStudio should provide a dedicated public method to generate this specific payload, or the ExternalScript class should not directly access internal state.
{ 'class' => 'HCKStudio', 'engine_tag' => studio.engine_tag }|
|
||
| def pre_tests_handle_finished_test_result(test, result) | ||
| @providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) } | ||
| end |
There was a problem hiding this comment.
The NotificationManager class is missing a close method that dispatches to its registered providers. The NotificationProviderInterface defines a close method, indicating that providers might need to perform cleanup (e.g., close files, release resources). Without this dispatch, resources held by providers might not be properly released, leading to potential leaks or ungraceful shutdowns.
def pre_tests_handle_finished_test_result(test, result)
@providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) }
end
def close
@providers.each_value { |p| p.close }
end| log_path="${AUTOHCK_NOTIFICATION_LOG_PATH:-$log_default}" | ||
|
|
||
| { | ||
| printf '%s\n' "=== $(date -Iseconds 2>/dev/null || date) ===" |
There was a problem hiding this comment.
The date -Iseconds command is a GNU coreutils extension and might not be available on all Unix-like systems (e.g., macOS, BSD). While there's a fallback to date, relying on specific date formats can lead to portability issues. Consider using a more universally available method for generating ISO 8601 timestamps if this script is expected to run in diverse environments.
| include Helper | ||
| include NotificationProviderInterface | ||
|
|
||
| CONFIG_JSON = 'lib/notificationmanagers/external_script/external_script.json' |
There was a problem hiding this comment.
The CONFIG_JSON path is hardcoded within the class. This limits flexibility if the configuration file needs to be located elsewhere or named differently in various deployment scenarios. Consider making this path configurable, perhaps through the Project object or an initializer parameter, to enhance adaptability.
| def to_json_safe(value) | ||
| case value | ||
| when nil then nil | ||
| when String, Numeric, TrueClass, FalseClass then value | ||
| when Symbol then value.to_s | ||
| when Hash then value.transform_values { |v| to_json_safe(v) } | ||
| when Array then value.map { |v| to_json_safe(v) } | ||
| when DateTime, Time then value.iso8601 | ||
| when Models::HLK::Test, Models::Driver then value.serialize | ||
| else | ||
| to_json_safe_unknown(value) | ||
| end |
There was a problem hiding this comment.
The to_json_safe method has a high cyclomatic complexity due to the extensive case statement. While the rubocop disable comment acknowledges this, it indicates a method that could be challenging to maintain and test. Consider refactoring this by pushing serialization logic into the respective classes (e.g., DateTime, Time, Models::HLK::Test, Models::Driver) by defining a serialize method on them. Then, to_json_safe could simply call value.serialize if the object responds to it, significantly reducing its complexity.
| def to_json_safe_unknown(value) | ||
| return value.serialize if value.respond_to?(:serialize) | ||
|
|
||
| { 'class' => value.class.name, 'inspect' => value.inspect.byteslice(0, 16_384) } |
There was a problem hiding this comment.
Using value.inspect.byteslice(0, 16_384) for unknown types can still be problematic. The inspect method can produce very verbose output, potentially leading to large environment variables which have size limits. More critically, inspect might expose sensitive information or internal object states that are not intended for external scripts. Consider a more controlled serialization for unknown types, perhaps just the class name, or enforce explicit serialization methods for any object that needs to be passed to external scripts.
|
@kostyanf14 Thanks, this is relevant to my CI use case. I’ll review it and do some validation on my side, then get back to you with feedback. Probably not in the next couple of days, but soon. |
zjmletang
left a comment
There was a problem hiding this comment.
Hi @kostyanf14, thanks for putting this together — the plugin layout and env-var dispatch are clean and easy to extend. I'd like to adopt this in our CI; sharing our two use cases and a bit of feedback around failure classification that I hope is useful for the final design.
Our two use cases
- On test completion → trigger downstream automation (picking up results for driver signing and other post-processing).
- On failure → automatically re-run the whole project (not individual tests). Total wall-time is not critical for us, but minimizing manual intervention is. In practice a large portion of "failures" pass on a plain retry, so a CI-level auto-retry removes a lot of toil.
The granularity we need is project-level re-run (much simpler to wire into CI).
Three failure classes & current hook coverage
From the CI-retry angle, AutoHCK failures really split into three classes:
| # | Class | Typical trigger | Coverage in this PR |
|---|---|---|---|
| 1 | Runtime error (engine / environment layer) | install timeout, studio/client VM failure, engine/tools raising, etc. | ❌ no dedicated event |
| 2 | Test-level failure (HLK test ran, status=Failed) | HLK returns Failed result | ✅ already covered by pre_tests_handle_finished_test_result(s) |
| 3 | Hang-on | test stuck in queued/running, HLK never returns | ❌ AutoHCK detects internally but doesn't surface it |
Class 2 is already actionable with the hooks in this PR (scripts can just read tests_stats.failed > 0) — no issue there. The gaps are class 1 and class 3.
Gap 1 (nice to have) — Class 1: runtime errors have no dedicated event
When the engine aborts before reaching the test-handling loop, pre_tests_handle_finished_test_result(s) is never reached. post_project_run does fire via ensure, but:
- the exception is neither passed as an argument nor surfaced in the env;
run_terminated(exposed viabound_project_args) only flips when the tracked PR is closed/merged ingithub_handling— it doesn't reflect engine run status.
So scripts can't distinguish "normal run completion" from "engine crashed mid-way". On our side a CI wrapper can work around this by watching the process exit code, so this is only nice to have.
Suggestion (non-blocking): if convenient, consider adding a dedicated failure event:
sig { abstract.params(project: Project, exception: Exception).void }
def on_project_run_failed(project, exception); endWired from Project#run:
def run
@notification_manager.pre_project_run(self)
@engine.run
rescue => e
@notification_manager.on_project_run_failed(self, e)
raise
ensure
@notification_manager.post_project_run(self)
endOn the ExternalScript side, the exception can be exposed via AUTOHCK_NOTIFICATION_EXCEPTION_CLASS / _MESSAGE / _BACKTRACE. The change is purely additive — no existing signature is touched.
Gap 2 (important) — Class 3: "hang-on" is detected but not surfaced
AutoHCK already detects suspected hangs in Tests#check_test_queued_time and Tests#check_test_duration_time — it logs a warning and sets ex_status = 'Hangs on at queued state?' / 'Hangs on at running state?', but:
- it never raises or aborts, so
ensure/rescue-based notifications never fire; - none of the existing hooks gets called because of a hang;
- external automation has no way to know the run is stuck. We routinely see runs sit in this state for hours until someone kills them.
This matters a lot for CI: you can't auto-retry something that never returns.
Suggestion: emit a dedicated notification event when a hang is detected, so operators/CI can decide how to act.
sig { abstract.params(test: Models::HLK::Test, state: String, elapsed: Integer).void }
def on_test_hang_detected(test, state, elapsed); endInvoked where set_test_status(..., 'Hangs on at ...') is called today; no existing behavior is changed.
A small caveat: honestly, I'm not yet sure myself whether the existing hang detection (QUEUE_TEST_TIMEOUT = 15min, RUNNING_TEST_TIMEOUT = 2 × duration + 15min) covers all the "test getting stuck" cases we care about — for example, situations where HLK keeps reporting Running while the test is actually spinning inside some internal loop may not be caught by the current thresholds. I'll confirm once we start running it in anger.
That doesn't change the ask here though: exposing what's already detected is the first step; if additional hang types or new detection logic show up later (progress-stall based, log silence, etc.), they can be layered onto this hook incrementally — either by adding new trigger points or by splitting out another hook. No need to solve everything in one go.
Summary
- Class 1 (runtime errors): suggest
on_project_run_failed— nice to have, we can work around this with a CI wrapper. - Class 2 (test-level failures): already covered by
pre_tests_handle_finished_test_result(s)✅. - Class 3 (hang-on): suggest
on_test_hang_detected—
@zjmletang, This is a draft PR, currently, I am not sure how we in RH will use it, so I can create only a general POC. Please review, test, and say what is missing for your idea of usage.