Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56ea720eb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
dougqh
left a comment
There was a problem hiding this comment.
Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem
DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(), a circular initialization dependency. When the two classes were first touched concurrently from opposite ends -- the service-discovery task (muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the application's first span (IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()) -- each thread held one class-init lock and waited for the other, hanging trace creation. This surfaced as recurring 30s LogInjectionSmokeTest timeouts in CI (latent since #9705 added the scheduled muteTracing task). Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public fields (preserving the API restored in #5021): ZERO/ONE are now instances of a private DDTraceId subtype (a sibling of DD64bTraceId), so DDTraceId.<clinit> no longer references the subclass. Replace the fragile "== DDTraceId.ZERO" identity checks with a value-based DDTraceId.isZero(). Those identity checks relied on every zero id being the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories need not route 0 to the singleton and the propagation codecs no longer mishandle a zero parsed via the direct 64-bit factories. Add a forked regression test that initializes the two classes concurrently from opposite ends (deadlocks without the fix), plus isZero() coverage across the DDTraceId subtypes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b04a0d1 to
0e15d6c
Compare
What Does This Do
Fixes a class-initialization deadlock between
DDTraceIdandDD64bTraceIdthat can hang trace creation at startup.DDTraceId.ZERO/ONEare now backed by a private sibling type instead ofDD64bTraceId, soDDTraceId.<clinit>no longer initializes its own subclass. The publicDDTraceId.ZERO/ONEfields are unchanged (no binary-incompatible change). A new value-basedDDTraceId.isZero()replaces the== DDTraceId.ZEROsentinel checks.Motivation
DD64bTraceIdextendsDDTraceId, so the JVM initializesDDTraceIdfirst. ButDDTraceId.<clinit>built itsZERO/ONEconstants viaDD64bTraceId.from(...), which initializes the subclass while theDDTraceIdinit lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runsmuteTracing()->blackholeSpan()->DDTraceId.ZEROmain: the application's first span runsIdGenerationStrategy.generateTraceId()->DD64bTraceId.from()Trace creation then hangs. This surfaced as recurring ~30s
LogInjectionSmokeTesttimeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.Additional Notes
ZERO/ONEstaypublic static final DDTraceIdfields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a privateDDTraceIdsubtype,ConstantId, that is a sibling ofDD64bTraceId. BecauseDDTraceId.<clinit>no longer references the subclass, the deadlock cannot happen regardless of timing.DDTraceId.isZero()instead of== DDTraceId.ZERO. The identity checks assumed every zero id was the singleZEROinstance;isZero()recognizes a zero id of any concrete type, so the factories no longer special-case 0 and a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHexin the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which== ZEROsilently missed.DDTraceIdClinitDeadlockForkedTestruns in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it.TraceIdIsZeroTestandDDTraceIdConstantsTestcoverisZero()and the constants across theDDTraceIdsubtypes.muteTracing()task; it began manifesting recently as startup timing shifted.Contributor Checklist
type:andcomp:labels assignedJira ticket: N/A