test: widen cluster shutdownAll await for aeron-udp drain on JDK 25 nightly#3017
Open
He-Pin wants to merge 1 commit into
Open
test: widen cluster shutdownAll await for aeron-udp drain on JDK 25 nightly#3017He-Pin wants to merge 1 commit into
He-Pin wants to merge 1 commit into
Conversation
…ightly Motivation: MixedProtocolClusterSpec "join a cluster with a node using the pekko protocol (udp)" still fails on virtualized runs after apache#2997 reordered shutdownAll to stop joining nodes first: [WARN] CoordinatedShutdown(pekko://MixedProtocolClusterSpec) Coordinated shutdown phase [actor-system-terminate] timed out after 30000 milliseconds java.lang.RuntimeException: Failed to stop [MixedProtocolClusterSpec] within [1 minute] ... StreamSupervisor ... remote-6-0-unnamed ActorGraphInterpreter The "within [1 minute]" outer await is 30s base dilated by pekko.test.timefactor=2, i.e. this lane runs at tf=2 (the JDK 25 nightly runs at tf=4 -> 120s and passes). The actor-system-terminate phase only calls system.finalTerminate() and recovers on its own (non-dilated) phase timeout while termination keeps draining in the background (CoordinatedShutdown.scala:264-269), so the inner phase WARN is non-binding noise -- ClusterTestUtil.shutdownAll's dilated await on whenTerminated is the real deadline. The aeron-udp transport is the slowest to drain (embedded media driver + stacked Aeron liveness timeouts), so 60s was simply too tight at tf=2. Modification: - ClusterTestUtil.shutdownAll: raise the outer await base from 30s to 60s (the binding, timefactor-dilated deadline), so a tf=2 lane gets ~120s -- the same headroom the tf=4 nightly already passes with. Document why this await, not the inner phase, governs pass/fail. - MixedProtocolClusterSpec baseConfig: raise the (non-dilated, non-binding) actor-system-terminate phase timeout 30s -> 60s to suppress the spurious WARN on the slow path and align it with the new await base. Result: aeron-udp cluster systems get enough wall-clock to terminate cleanly on lower-timefactor virtualized lanes without the shutdown-phase abort. Healthy shutdowns still complete in well under a second, so local and normal CI runs are unaffected. Test-only change; no production behaviour or binary-compatibility impact. Tests: - sbt "cluster/Test/compile" - success (cluster test-classes compiled) - scalafmt 3.10.7 on both changed files - no reformatting needed - git diff --check - clean - aeron-udp shutdown timing is timefactor/environment dependent and does not reproduce on local runs (shutdown completes <1s); change is a timeout widening verified by compile + format. References: nightly-builds.yml MixedProtocolClusterSpec (udp) shutdown timeout; follow-up to apache#2997
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MixedProtocolClusterSpec"be allowed to join a cluster with a node using the pekko protocol (udp)" still fails on virtualized JDK 25 runs after #2997 reorderedshutdownAllto stop joining nodes first:The
"within [1 minute]"outer await is a30sbase dilated bypekko.test.timefactor— i.e. this lane runs at tf=2 (30s × 2 = 60s). The JDK 25 nightly runs at tf=4 →120sand passes.The
actor-system-terminatephase only callssystem.finalTerminate()and recovers on its own (non-dilated) phase timeout while termination keeps draining in the background (CoordinatedShutdown.scala:264-269). So the inner-phase WARN is non-binding noise —ClusterTestUtil.shutdownAll's dilated await onwhenTerminatedis the real pass/fail deadline. The aeron-udp transport is the slowest to drain (embedded media driver + stacked Aeron liveness timeouts), so60swas simply too tight at tf=2.Modification
ClusterTestUtil.shutdownAll: raise the outer await base30s → 60s(the binding, timefactor-dilated deadline), so a tf=2 lane gets ~120s— the same headroom the tf=4 nightly already passes with. Added a comment explaining why this await, not the inner phase, governs pass/fail.MixedProtocolClusterSpecbaseConfig: raise the (non-dilated, non-binding)actor-system-terminatephase timeout30s → 60sto suppress the spurious WARN on the slow path and align it with the new await base.This is a follow-up to #2997 — that PR pulled the shutdown-ordering lever; this one fixes the binding outer-await deadline.
Result
aeron-udp cluster systems get enough wall-clock to terminate cleanly on lower-timefactor virtualized lanes without the shutdown-phase abort. Healthy shutdowns still complete in well under a second, so local and normal CI runs are unaffected. Test-only change — no production behaviour or binary-compatibility impact.
Tests
sbt "cluster/Test/compile"— success (cluster test-classes compiled)scalafmt 3.10.7on both changed files — no reformatting neededgit diff --check— cleanReferences
nightly-builds.ymlMixedProtocolClusterSpec(udp) shutdown timeout