POC: Notification manager by kostyanf14 · Pull Request #920 · HCK-CI/AutoHCK

kostyanf14 · 2026-04-03T16:49:59Z

@zjmletang, This is a draft PR, currently, I am not sure how we in RH will use it, so I can create only a general POC. Please review, test, and say what is missing for your idea of usage.

Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive notification system to AutoHCK, enabling external hooks into project, engine, and test lifecycles via a new NotificationManager and ExternalScript provider. The review feedback highlights several areas for improvement: addressing encapsulation violations when accessing internal object states, ensuring proper resource cleanup by implementing a close method in the NotificationManager, and enhancing the portability and maintainability of the script execution and serialization logic.

gemini-code-assist · 2026-04-03T16:57:45Z

+
+    sig { params(studio: HCKStudio).returns(T::Hash[String, T.untyped]) }
+    def hck_studio_payload(studio)
+      { 'class' => 'HCKStudio', 'engine_tag' => studio.instance_variable_get(:@tag) }


Accessing @tag directly using studio.instance_variable_get(:@tag) violates encapsulation. If engine_tag is intended to be a public attribute, it should be exposed via a public reader method (e.g., studio.engine_tag). If it's meant to be private, then HCKStudio should provide a dedicated public method to generate this specific payload, or the ExternalScript class should not directly access internal state.

{ 'class' => 'HCKStudio', 'engine_tag' => studio.engine_tag }

gemini-code-assist · 2026-04-03T16:57:45Z

+
+    def pre_tests_handle_finished_test_result(test, result)
+      @providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) }
+    end


The NotificationManager class is missing a close method that dispatches to its registered providers. The NotificationProviderInterface defines a close method, indicating that providers might need to perform cleanup (e.g., close files, release resources). Without this dispatch, resources held by providers might not be properly released, leading to potential leaks or ungraceful shutdowns.

def pre_tests_handle_finished_test_result(test, result) @providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) } end def close @providers.each_value { |p| p.close } end

gemini-code-assist · 2026-04-03T16:57:45Z

+log_path="${AUTOHCK_NOTIFICATION_LOG_PATH:-$log_default}"
+
+{
+  printf '%s\n' "=== $(date -Iseconds 2>/dev/null || date) ==="


The date -Iseconds command is a GNU coreutils extension and might not be available on all Unix-like systems (e.g., macOS, BSD). While there's a fallback to date, relying on specific date formats can lead to portability issues. Consider using a more universally available method for generating ISO 8601 timestamps if this script is expected to run in diverse environments.

gemini-code-assist · 2026-04-03T16:57:45Z

+    include Helper
+    include NotificationProviderInterface
+
+    CONFIG_JSON = 'lib/notificationmanagers/external_script/external_script.json'


The CONFIG_JSON path is hardcoded within the class. This limits flexibility if the configuration file needs to be located elsewhere or named differently in various deployment scenarios. Consider making this path configurable, perhaps through the Project object or an initializer parameter, to enhance adaptability.

gemini-code-assist · 2026-04-03T16:57:45Z

+    def to_json_safe(value)
+      case value
+      when nil then nil
+      when String, Numeric, TrueClass, FalseClass then value
+      when Symbol then value.to_s
+      when Hash then value.transform_values { |v| to_json_safe(v) }
+      when Array then value.map { |v| to_json_safe(v) }
+      when DateTime, Time then value.iso8601
+      when Models::HLK::Test, Models::Driver then value.serialize
+      else
+        to_json_safe_unknown(value)
+      end


The to_json_safe method has a high cyclomatic complexity due to the extensive case statement. While the rubocop disable comment acknowledges this, it indicates a method that could be challenging to maintain and test. Consider refactoring this by pushing serialization logic into the respective classes (e.g., DateTime, Time, Models::HLK::Test, Models::Driver) by defining a serialize method on them. Then, to_json_safe could simply call value.serialize if the object responds to it, significantly reducing its complexity.

gemini-code-assist · 2026-04-03T16:57:45Z

+    def to_json_safe_unknown(value)
+      return value.serialize if value.respond_to?(:serialize)
+
+      { 'class' => value.class.name, 'inspect' => value.inspect.byteslice(0, 16_384) }


Using value.inspect.byteslice(0, 16_384) for unknown types can still be problematic. The inspect method can produce very verbose output, potentially leading to large environment variables which have size limits. More critically, inspect might expose sensitive information or internal object states that are not intended for external scripts. Consider a more controlled serialization for unknown types, perhaps just the class name, or enforce explicit serialization methods for any object that needs to be passed to external scripts.

zjmletang · 2026-04-04T08:11:35Z

@kostyanf14 Thanks, this is relevant to my CI use case. I’ll review it and do some validation on my side, then get back to you with feedback. Probably not in the next couple of days, but soon.

zjmletang

Hi @kostyanf14, thanks for putting this together — the plugin layout and env-var dispatch are clean and easy to extend. I'd like to adopt this in our CI; sharing our two use cases and a bit of feedback around failure classification that I hope is useful for the final design.

Our two use cases

On test completion → trigger downstream automation (picking up results for driver signing and other post-processing).
On failure → automatically re-run the whole project (not individual tests). Total wall-time is not critical for us, but minimizing manual intervention is. In practice a large portion of "failures" pass on a plain retry, so a CI-level auto-retry removes a lot of toil.

The granularity we need is project-level re-run (much simpler to wire into CI).

Three failure classes & current hook coverage

From the CI-retry angle, AutoHCK failures really split into three classes:

#	Class	Typical trigger	Coverage in this PR
1	Runtime error (engine / environment layer)	install timeout, studio/client VM failure, engine/tools raising, etc.	❌ no dedicated event
2	Test-level failure (HLK test ran, status=Failed)	HLK returns Failed result	✅ already covered by `pre_tests_handle_finished_test_result(s)`
3	Hang-on	test stuck in queued/running, HLK never returns	❌ AutoHCK detects internally but doesn't surface it

Class 2 is already actionable with the hooks in this PR (scripts can just read tests_stats.failed > 0) — no issue there. The gaps are class 1 and class 3.

Gap 1 (nice to have) — Class 1: runtime errors have no dedicated event

When the engine aborts before reaching the test-handling loop, pre_tests_handle_finished_test_result(s) is never reached. post_project_run does fire via ensure, but:

the exception is neither passed as an argument nor surfaced in the env;
run_terminated (exposed via bound_project_args) only flips when the tracked PR is closed/merged in github_handling — it doesn't reflect engine run status.

So scripts can't distinguish "normal run completion" from "engine crashed mid-way". On our side a CI wrapper can work around this by watching the process exit code, so this is only nice to have.

Suggestion (non-blocking): if convenient, consider adding a dedicated failure event:

sig { abstract.params(project: Project, exception: Exception).void }
def on_project_run_failed(project, exception); end

Wired from Project#run:

def run
  @notification_manager.pre_project_run(self)
  @engine.run
rescue => e
  @notification_manager.on_project_run_failed(self, e)
  raise
ensure
  @notification_manager.post_project_run(self)
end

On the ExternalScript side, the exception can be exposed via AUTOHCK_NOTIFICATION_EXCEPTION_CLASS / _MESSAGE / _BACKTRACE. The change is purely additive — no existing signature is touched.

Gap 2 (important) — Class 3: "hang-on" is detected but not surfaced

AutoHCK already detects suspected hangs in Tests#check_test_queued_time and Tests#check_test_duration_time — it logs a warning and sets ex_status = 'Hangs on at queued state?' / 'Hangs on at running state?', but:

it never raises or aborts, so ensure/rescue-based notifications never fire;
none of the existing hooks gets called because of a hang;
external automation has no way to know the run is stuck. We routinely see runs sit in this state for hours until someone kills them.

This matters a lot for CI: you can't auto-retry something that never returns.

Suggestion: emit a dedicated notification event when a hang is detected, so operators/CI can decide how to act.

sig { abstract.params(test: Models::HLK::Test, state: String, elapsed: Integer).void }
def on_test_hang_detected(test, state, elapsed); end

Invoked where set_test_status(..., 'Hangs on at ...') is called today; no existing behavior is changed.

A small caveat: honestly, I'm not yet sure myself whether the existing hang detection (QUEUE_TEST_TIMEOUT = 15min, RUNNING_TEST_TIMEOUT = 2 × duration + 15min) covers all the "test getting stuck" cases we care about — for example, situations where HLK keeps reporting Running while the test is actually spinning inside some internal loop may not be caught by the current thresholds. I'll confirm once we start running it in anger.

That doesn't change the ask here though: exposing what's already detected is the first step; if additional hang types or new detection logic show up later (progress-stall based, log silence, etc.), they can be layered onto this hook incrementally — either by adding new trigger points or by splitting out another hook. No need to solve everything in one go.

Summary

Class 1 (runtime errors): suggest on_project_run_failed — nice to have, we can work around this with a CI wrapper.
Class 2 (test-level failures): already covered by pre_tests_handle_finished_test_result(s) ✅.
Class 3 (hang-on): suggest on_test_hang_detected —

kostyanf14 added 3 commits April 3, 2026 19:43

Define new NotificationProvider and NotificationManager

96091f3

Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>

Implement NotificationManager

ab0f281

Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>

Implement ExternalScript notification provider

ec59f1b

Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

zjmletang reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Notification manager#920

POC: Notification manager#920
kostyanf14 wants to merge 3 commits into
HCK-CI:masterfrom
kostyanf14:notification-manager

kostyanf14 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

zjmletang commented Apr 4, 2026

Uh oh!

zjmletang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kostyanf14 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zjmletang commented Apr 4, 2026

Uh oh!

zjmletang left a comment

Choose a reason for hiding this comment

Our two use cases

Three failure classes & current hook coverage

Gap 1 (nice to have) — Class 1: runtime errors have no dedicated event

Gap 2 (important) — Class 3: "hang-on" is detected but not surfaced

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants