Skip to content

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509

Open
bm1549 wants to merge 2 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock
Open

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
bm1549 wants to merge 2 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock

Conversation

@bm1549
Copy link
Copy Markdown
Contributor

@bm1549 bm1549 commented May 29, 2026

What Does This Do

Fixes a class-initialization deadlock between DDTraceId and DD64bTraceId that can hang trace creation at startup. DDTraceId.ZERO/ONE are now backed by a private sibling type instead of DD64bTraceId, so DDTraceId.<clinit> no longer initializes its own subclass. The public DDTraceId.ZERO/ONE fields are unchanged (no binary-incompatible change). A new value-based DDTraceId.isZero() replaces the == DDTraceId.ZERO sentinel checks.

Motivation

DD64bTraceId extends DDTraceId, so the JVM initializes DDTraceId first. But DDTraceId.<clinit> built its ZERO/ONE constants via DD64bTraceId.from(...), which initializes the subclass while the DDTraceId init lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:

  • dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runs muteTracing() -> blackholeSpan() -> DDTraceId.ZERO
  • main: the application's first span runs IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()

Trace creation then hangs. This surfaced as recurring ~30s LogInjectionSmokeTest timeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.

Additional Notes

  • Approach: break the cycle at its source. ZERO/ONE stay public static final DDTraceId fields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a private DDTraceId subtype, ConstantId, that is a sibling of DD64bTraceId. Because DDTraceId.<clinit> no longer references the subclass, the deadlock cannot happen regardless of timing.
  • Zero checks now use a value-based DDTraceId.isZero() instead of == DDTraceId.ZERO. The identity checks assumed every zero id was the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories no longer special-case 0 and a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHex in the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which == ZERO silently missed.
  • DDTraceIdClinitDeadlockForkedTest runs in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it. TraceIdIsZeroTest and DDTraceIdConstantsTest cover isZero() and the constants across the DDTraceId subtypes.
  • The deadlock has been latent since Add support for service discovery using JNA #9705 (Oct 2025) added the scheduled muteTracing() task; it began manifesting recently as startup timing shifted.

Contributor Checklist

  • Title formatted per the contribution guidelines
  • type: and comp: labels assigned
  • No issue-linking keywords used
  • CODEOWNERS update not required (no file addition/migration/deletion)
  • Public documentation update not required (no new configuration or behavior)

Jira ticket: N/A

@bm1549 bm1549 added type: bug Bug report and fix comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026
@datadog-prod-us1-3

This comment has been minimized.

@bm1549 bm1549 marked this pull request as ready for review May 29, 2026 19:13
@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:13
@bm1549 bm1549 requested a review from dougqh May 29, 2026 19:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56ea720eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 14.08 s 14.01 s [-0.7%; +1.7%] (no difference)
startup:insecure-bank:tracing:Agent 12.90 s 12.93 s [-1.1%; +0.6%] (no difference)
startup:petclinic:appsec:Agent 16.57 s 15.53 s [-2.4%; +15.7%] (unstable)
startup:petclinic:iast:Agent 16.41 s 16.46 s [-1.5%; +0.9%] (no difference)
startup:petclinic:profiling:Agent 16.40 s 15.62 s [-3.8%; +13.7%] (unstable)
startup:petclinic:tracing:Agent 15.65 s 14.87 s [-3.3%; +13.7%] (unstable)

Commit: ded2c7c7 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:29
@bm1549 bm1549 requested review from mcculls and removed request for a team May 29, 2026 19:29
Copy link
Copy Markdown
Contributor

@dougqh dougqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem

DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize
DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized
DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(),
a circular initialization dependency. When the two classes were first
touched concurrently from opposite ends -- the service-discovery task
(muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the
application's first span (IdGenerationStrategy.generateTraceId() ->
DD64bTraceId.from()) -- each thread held one class-init lock and waited
for the other, hanging trace creation. This surfaced as recurring 30s
LogInjectionSmokeTest timeouts in CI (latent since #9705 added the
scheduled muteTracing task).

Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public
fields (preserving the API restored in #5021): ZERO/ONE are now instances
of a private DDTraceId subtype (a sibling of DD64bTraceId), so
DDTraceId.<clinit> no longer references the subclass.

Replace the fragile "== DDTraceId.ZERO" identity checks with a
value-based DDTraceId.isZero(). Those identity checks relied on every
zero id being the single ZERO instance; isZero() recognizes a zero id of
any concrete type, so the factories need not route 0 to the singleton and
the propagation codecs no longer mishandle a zero parsed via the direct
64-bit factories.

Add a forked regression test that initializes the two classes
concurrently from opposite ends (deadlocks without the fix), plus
isZero() coverage across the DDTraceId subtypes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 force-pushed the brian.marks/fix-ddtraceid-clinit-deadlock branch from b04a0d1 to 0e15d6c Compare May 30, 2026 02:46
@bm1549 bm1549 requested a review from a team as a code owner May 30, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants