Skip to content

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500

Open
dougqh wants to merge 6 commits into
masterfrom
dougqh/css-tight-heap-defaults
Open

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
dougqh wants to merge 6 commits into
masterfrom
dougqh/css-tight-heap-defaults

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 29, 2026

What Does This Do

Cut the implicit tracer.metrics.max.pending default from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicit tracer.metrics.max.aggregates default from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.

Motivation

The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.

At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic — SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).

JFR allocation profile at Xmx 64 m attributes this to SpanSnapshot being the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.

New defaults

Heap maxAggregates maxPending (logical) Inbox slots Worst-case in-flight
Normal (≥ 128 MB) 2048 (unchanged) 128 8,192 ~1 MB
Tight (< 128 MB) 256 64 4,096 ~500 KB

Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

What this is not

  • Not a queue-mechanism change. The inbox stays MpscArrayQueue<SpanSnapshot>.
  • Not an allocation-profile change. Producers still allocate one SpanSnapshot per metrics-eligible span.
  • Not a feature change. Drops on overflow still flow through onStatsInboxFull.

It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.

Validation

Petclinic load test (Java 17, 8 jmeter threads, GET /owners/3), cooled bench with 60s between runs, n=2 per heap.

vs 1.62 baseline

Xmx #11500 mean 1.62 mean Delta
64m 2,422 ± 156 2,490 ± 89 −2.7%
96m 5,974 ± 406 5,948 ± 416 +0.4%
192m † 9,523 ± 390 10,007 ± 994 −4.8%
256m † 10,052 ± 42 9,153 ± 321 +9.8%

† At Xmx ≥ 128 MB the maxAggregates default is unchanged and the maxPending change only reduces worst-case in-flight memory (rarely binding at large heap). 192m / 256m measurements above are from same-day runs of the merged-master baseline (#11382 tip) and apply unchanged to #11500.

vs merged master baseline (the gain this PR contributes)

Xmx #11500 #11382 (master) Delta
64m 2,422 2,349 +3.1%
96m 5,974 6,127 −2.5% (within noise)

The +3.1% at 64m is the targeted improvement: the smaller inbox keeps the SpanSnapshot footprint inside G1 survivor regions at tight heap. At 96m the original default never bound, so the PR is a wash there.

Variance behavior at tight heap

During the same thermal window, 1.62 hit a To-space Exhausted variance bomb at 96m (one iter collapsed to 2,496 r/s):

1.62   at 96m, iter 1: 5,583
1.62   at 96m, iter 2: 2,496   ← To-space Exhausted → Full GC storm
#11500 at 96m, iter 1: 6,262
#11500 at 96m, iter 2: 5,687

#11500 sustained throughput. This is the catastrophic failure mode the PR is designed to soften — without it, the bomb is observable at production Xmx ≤ ~128 MB.

Test plan

  • :internal-api:testConfigTest* passes
  • :dd-trace-core:testdatadog.trace.common.metrics.* passes (92/92 locally)
  • Petclinic load test sweep at Xmx 64m + 96m (direct), 192m + 256m (proxy)

🤖 Generated with Claude Code

@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026
@datadog-prod-us1-6

This comment has been minimized.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.96 s 13.86 s [-0.2%; +1.7%] (no difference)
startup:insecure-bank:tracing:Agent 13.00 s 12.94 s [-0.7%; +1.6%] (no difference)
startup:petclinic:appsec:Agent 16.57 s 16.31 s [+0.4%; +2.7%] (maybe worse)
startup:petclinic:iast:Agent 16.61 s 16.58 s [-1.1%; +1.5%] (no difference)
startup:petclinic:profiling:Agent 16.38 s 16.52 s [-2.3%; +0.6%] (no difference)
startup:petclinic:tracing:Agent 14.90 s 15.67 s [-12.6%; +2.8%] (unstable)

Commit: 65a53a69 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128
on normal heap and to 64 at Xmx < 128 MB, and the implicit
TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap.

Why
---
The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to
maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot,
the prior 131072-slot default pinned ~15 MB worst-case in-flight when the
aggregator stalled.

At Xmx <= ~128 MB the G1 survivor region is too small to absorb that
footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic
where the inbox overflowed young gen and triggered To-space Exhausted →
Full GC storms (0 r/s in the worst case).

New defaults bound the worst-case in-flight footprint at ~1 MB on normal
heap and ~500 KB at tight heap, comfortably below typical survivor sizes
and large enough to absorb the sub-second consumer stalls we actually see
in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

Customers who explicitly configure TRACER_METRICS_MAX_PENDING are
unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides.
Only the implicit defaults shrink.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh marked this pull request as ready for review May 29, 2026 16:51
@dougqh dougqh force-pushed the dougqh/css-tight-heap-defaults branch from 391704f to a9acd8d Compare May 29, 2026 16:51
@dougqh dougqh requested a review from a team as a code owner May 29, 2026 16:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9acd8d1d5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2211 to +2212
final int defaultMaxAggregates = tightHeap ? 256 : 2048;
final int defaultMaxPending = tightHeap ? 64 : 128;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sync metadata with the new tracer-metrics defaults

Changing the implicit default here leaves metadata/supported-configurations.json advertising DD_TRACE_TRACER_METRICS_MAX_PENDING and DD_TRACE_TRACER_METRICS_MAX_AGGREGATES as 2048. That file is the source used for supported-configuration metadata/docs, so users and config-inversion tooling will still see the old defaults even though normal heaps now get 128 pending and tight heaps get 64/256. Please update the metadata entry (or otherwise represent the heap-dependent default) along with this runtime change.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

@dougqh dougqh May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude...

Thanks — pushed 9ab5e58e4e updating DD_TRACE_TRACER_METRICS_MAX_PENDING from 2048128 to match the new normal-heap default.

DD_TRACE_TRACER_METRICS_MAX_AGGREGATES is left at 2048: that's still the normal-heap default (only the tight-heap branch changes it to 256). The current schema is {version, type, default, aliases} with a single string default — there's no way to encode "heap-dependent default" without a schema extension (e.g. a defaultExpression field or defaults: [{when, value}] shape). So both keys' tight-heap branches stay unrepresented in metadata; the typical customer (Xmx ≥ 128 MB) sees the documented value.

If a schema extension is in scope, happy to take that as a follow-up.

The default cut from 2048 → 128 needs the matching entry in
metadata/supported-configurations.json so config-inversion tooling
and supported-configuration docs reflect the new value.

DD_TRACE_TRACER_METRICS_MAX_AGGREGATES is left at 2048: the normal-heap
default is unchanged. The metadata schema doesn't support heap-dependent
defaults, so the tight-heap branch (64 / 256) isn't representable; the
metadata reflects the normal-heap default that applies to the typical
customer (Xmx >= 128 MB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh requested a review from a team as a code owner May 29, 2026 17:08
@dougqh dougqh requested review from AlexeyKuznetsov-DD, bric3 and sarahchen6 and removed request for a team May 29, 2026 17:08
Comment thread metadata/supported-configurations.json
Default changed from 2048 to 128; version field must be incremented
when the default value changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DataDog DataDog deleted a comment from datadog-datadog-prod-us1 Bot May 29, 2026
@dougqh dougqh enabled auto-merge May 29, 2026 18:21
@dougqh dougqh added this pull request to the merge queue May 29, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

/merge

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

gh-worker-devflow-routing-ef8351 Bot commented May 29, 2026

View all feedbacks in Devflow UI.

2026-05-29 21:04:53 UTC ℹ️ Start processing command /merge


2026-05-29 21:04:57 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).


2026-05-29 23:05:30 UTCMergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 6485393595883034246 took longer than expected. The current limit for the base branch 'master' is 120 minutes.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants