Skip to content

POC: Notification manager#920

Draft
kostyanf14 wants to merge 3 commits into
HCK-CI:masterfrom
kostyanf14:notification-manager
Draft

POC: Notification manager#920
kostyanf14 wants to merge 3 commits into
HCK-CI:masterfrom
kostyanf14:notification-manager

Conversation

@kostyanf14
Copy link
Copy Markdown
Contributor

@zjmletang, This is a draft PR, currently, I am not sure how we in RH will use it, so I can create only a general POC. Please review, test, and say what is missing for your idea of usage.

Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
Signed-off-by: Kostiantyn Kostiuk <kkostiuk@redhat.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive notification system to AutoHCK, enabling external hooks into project, engine, and test lifecycles via a new NotificationManager and ExternalScript provider. The review feedback highlights several areas for improvement: addressing encapsulation violations when accessing internal object states, ensuring proper resource cleanup by implementing a close method in the NotificationManager, and enhancing the portability and maintainability of the script execution and serialization logic.


sig { params(studio: HCKStudio).returns(T::Hash[String, T.untyped]) }
def hck_studio_payload(studio)
{ 'class' => 'HCKStudio', 'engine_tag' => studio.instance_variable_get(:@tag) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing @tag directly using studio.instance_variable_get(:@tag) violates encapsulation. If engine_tag is intended to be a public attribute, it should be exposed via a public reader method (e.g., studio.engine_tag). If it's meant to be private, then HCKStudio should provide a dedicated public method to generate this specific payload, or the ExternalScript class should not directly access internal state.

      { 'class' => 'HCKStudio', 'engine_tag' => studio.engine_tag }


def pre_tests_handle_finished_test_result(test, result)
@providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) }
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The NotificationManager class is missing a close method that dispatches to its registered providers. The NotificationProviderInterface defines a close method, indicating that providers might need to perform cleanup (e.g., close files, release resources). Without this dispatch, resources held by providers might not be properly released, leading to potential leaks or ungraceful shutdowns.

    def pre_tests_handle_finished_test_result(test, result)
      @providers.each_value { |p| p.pre_tests_handle_finished_test_result(test, result) }
    end

    def close
      @providers.each_value { |p| p.close }
    end

log_path="${AUTOHCK_NOTIFICATION_LOG_PATH:-$log_default}"

{
printf '%s\n' "=== $(date -Iseconds 2>/dev/null || date) ==="
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date -Iseconds command is a GNU coreutils extension and might not be available on all Unix-like systems (e.g., macOS, BSD). While there's a fallback to date, relying on specific date formats can lead to portability issues. Consider using a more universally available method for generating ISO 8601 timestamps if this script is expected to run in diverse environments.

include Helper
include NotificationProviderInterface

CONFIG_JSON = 'lib/notificationmanagers/external_script/external_script.json'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CONFIG_JSON path is hardcoded within the class. This limits flexibility if the configuration file needs to be located elsewhere or named differently in various deployment scenarios. Consider making this path configurable, perhaps through the Project object or an initializer parameter, to enhance adaptability.

Comment on lines +237 to +248
def to_json_safe(value)
case value
when nil then nil
when String, Numeric, TrueClass, FalseClass then value
when Symbol then value.to_s
when Hash then value.transform_values { |v| to_json_safe(v) }
when Array then value.map { |v| to_json_safe(v) }
when DateTime, Time then value.iso8601
when Models::HLK::Test, Models::Driver then value.serialize
else
to_json_safe_unknown(value)
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The to_json_safe method has a high cyclomatic complexity due to the extensive case statement. While the rubocop disable comment acknowledges this, it indicates a method that could be challenging to maintain and test. Consider refactoring this by pushing serialization logic into the respective classes (e.g., DateTime, Time, Models::HLK::Test, Models::Driver) by defining a serialize method on them. Then, to_json_safe could simply call value.serialize if the object responds to it, significantly reducing its complexity.

def to_json_safe_unknown(value)
return value.serialize if value.respond_to?(:serialize)

{ 'class' => value.class.name, 'inspect' => value.inspect.byteslice(0, 16_384) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using value.inspect.byteslice(0, 16_384) for unknown types can still be problematic. The inspect method can produce very verbose output, potentially leading to large environment variables which have size limits. More critically, inspect might expose sensitive information or internal object states that are not intended for external scripts. Consider a more controlled serialization for unknown types, perhaps just the class name, or enforce explicit serialization methods for any object that needs to be passed to external scripts.

@zjmletang
Copy link
Copy Markdown
Contributor

@kostyanf14 Thanks, this is relevant to my CI use case. I’ll review it and do some validation on my side, then get back to you with feedback. Probably not in the next couple of days, but soon.

Copy link
Copy Markdown
Contributor

@zjmletang zjmletang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kostyanf14, thanks for putting this together — the plugin layout and env-var dispatch are clean and easy to extend. I'd like to adopt this in our CI; sharing our two use cases and a bit of feedback around failure classification that I hope is useful for the final design.

Our two use cases

  1. On test completion → trigger downstream automation (picking up results for driver signing and other post-processing).
  2. On failure → automatically re-run the whole project (not individual tests). Total wall-time is not critical for us, but minimizing manual intervention is. In practice a large portion of "failures" pass on a plain retry, so a CI-level auto-retry removes a lot of toil.

The granularity we need is project-level re-run (much simpler to wire into CI).

Three failure classes & current hook coverage

From the CI-retry angle, AutoHCK failures really split into three classes:

# Class Typical trigger Coverage in this PR
1 Runtime error (engine / environment layer) install timeout, studio/client VM failure, engine/tools raising, etc. ❌ no dedicated event
2 Test-level failure (HLK test ran, status=Failed) HLK returns Failed result ✅ already covered by pre_tests_handle_finished_test_result(s)
3 Hang-on test stuck in queued/running, HLK never returns ❌ AutoHCK detects internally but doesn't surface it

Class 2 is already actionable with the hooks in this PR (scripts can just read tests_stats.failed > 0) — no issue there. The gaps are class 1 and class 3.

Gap 1 (nice to have) — Class 1: runtime errors have no dedicated event

When the engine aborts before reaching the test-handling loop, pre_tests_handle_finished_test_result(s) is never reached. post_project_run does fire via ensure, but:

  • the exception is neither passed as an argument nor surfaced in the env;
  • run_terminated (exposed via bound_project_args) only flips when the tracked PR is closed/merged in github_handling — it doesn't reflect engine run status.

So scripts can't distinguish "normal run completion" from "engine crashed mid-way". On our side a CI wrapper can work around this by watching the process exit code, so this is only nice to have.

Suggestion (non-blocking): if convenient, consider adding a dedicated failure event:

sig { abstract.params(project: Project, exception: Exception).void }
def on_project_run_failed(project, exception); end

Wired from Project#run:

def run
  @notification_manager.pre_project_run(self)
  @engine.run
rescue => e
  @notification_manager.on_project_run_failed(self, e)
  raise
ensure
  @notification_manager.post_project_run(self)
end

On the ExternalScript side, the exception can be exposed via AUTOHCK_NOTIFICATION_EXCEPTION_CLASS / _MESSAGE / _BACKTRACE. The change is purely additive — no existing signature is touched.

Gap 2 (important) — Class 3: "hang-on" is detected but not surfaced

AutoHCK already detects suspected hangs in Tests#check_test_queued_time and Tests#check_test_duration_time — it logs a warning and sets ex_status = 'Hangs on at queued state?' / 'Hangs on at running state?', but:

  • it never raises or aborts, so ensure/rescue-based notifications never fire;
  • none of the existing hooks gets called because of a hang;
  • external automation has no way to know the run is stuck. We routinely see runs sit in this state for hours until someone kills them.

This matters a lot for CI: you can't auto-retry something that never returns.

Suggestion: emit a dedicated notification event when a hang is detected, so operators/CI can decide how to act.

sig { abstract.params(test: Models::HLK::Test, state: String, elapsed: Integer).void }
def on_test_hang_detected(test, state, elapsed); end

Invoked where set_test_status(..., 'Hangs on at ...') is called today; no existing behavior is changed.

A small caveat: honestly, I'm not yet sure myself whether the existing hang detection (QUEUE_TEST_TIMEOUT = 15min, RUNNING_TEST_TIMEOUT = 2 × duration + 15min) covers all the "test getting stuck" cases we care about — for example, situations where HLK keeps reporting Running while the test is actually spinning inside some internal loop may not be caught by the current thresholds. I'll confirm once we start running it in anger.

That doesn't change the ask here though: exposing what's already detected is the first step; if additional hang types or new detection logic show up later (progress-stall based, log silence, etc.), they can be layered onto this hook incrementally — either by adding new trigger points or by splitting out another hook. No need to solve everything in one go.

Summary

  • Class 1 (runtime errors): suggest on_project_run_failednice to have, we can work around this with a CI wrapper.
  • Class 2 (test-level failures): already covered by pre_tests_handle_finished_test_result(s) ✅.
  • Class 3 (hang-on): suggest on_test_hang_detected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants