Tighten tracer.metrics defaults to protect tight-heap JVMs by dougqh · Pull Request #11500 · DataDog/dd-trace-java

dougqh · 2026-05-29T13:07:31Z

What Does This Do

Cut the implicit tracer.metrics.max.pending default from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicit tracer.metrics.max.aggregates default from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.

Motivation

The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.

At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic — SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).

JFR allocation profile at Xmx 64 m attributes this to SpanSnapshot being the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.

New defaults

Heap	maxAggregates	maxPending (logical)	Inbox slots	Worst-case in-flight
Normal (≥ 128 MB)	2048 (unchanged)	128	8,192	~1 MB
Tight (< 128 MB)	256	64	4,096	~500 KB

Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).

What this is not

Not a queue-mechanism change. The inbox stays MpscArrayQueue<SpanSnapshot>.
Not an allocation-profile change. Producers still allocate one SpanSnapshot per metrics-eligible span.
Not a feature change. Drops on overflow still flow through onStatsInboxFull.

It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.

Validation

Petclinic load test (Java 17, 8 jmeter threads, GET /owners/3), cooled bench with 60s between runs, n=2 per heap.

vs 1.62 baseline

Xmx	#11500 mean	1.62 mean	Delta
64m	2,422 ± 156	2,490 ± 89	−2.7%
96m	5,974 ± 406	5,948 ± 416	+0.4%
192m †	9,523 ± 390	10,007 ± 994	−4.8%
256m †	10,052 ± 42	9,153 ± 321	+9.8%

† At Xmx ≥ 128 MB the maxAggregates default is unchanged and the maxPending change only reduces worst-case in-flight memory (rarely binding at large heap). 192m / 256m measurements above are from same-day runs of the merged-master baseline (#11382 tip) and apply unchanged to #11500.

vs merged master baseline (the gain this PR contributes)

Xmx	#11500	#11382 (master)	Delta
64m	2,422	2,349	+3.1%
96m	5,974	6,127	−2.5% (within noise)

The +3.1% at 64m is the targeted improvement: the smaller inbox keeps the SpanSnapshot footprint inside G1 survivor regions at tight heap. At 96m the original default never bound, so the PR is a wash there.

Variance behavior at tight heap

During the same thermal window, 1.62 hit a To-space Exhausted variance bomb at 96m (one iter collapsed to 2,496 r/s):

1.62   at 96m, iter 1: 5,583
1.62   at 96m, iter 2: 2,496   ← To-space Exhausted → Full GC storm
#11500 at 96m, iter 1: 6,262
#11500 at 96m, iter 2: 5,687

#11500 sustained throughput. This is the catastrophic failure mode the PR is designed to soften — without it, the bomb is observable at production Xmx ≤ ~128 MB.

Test plan

:internal-api:test — ConfigTest* passes
:dd-trace-core:test — datadog.trace.common.metrics.* passes (92/92 locally)
Petclinic load test sweep at Xmx 64m + 96m (direct), 192m + 256m (proxy)

🤖 Generated with Claude Code

dd-octo-sts · 2026-05-29T13:26:01Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Scenario	Candidate	master	Δ (95% CI of mean)
startup:insecure-bank:iast:Agent	13.96 s	13.86 s	[-0.2%; +1.7%] (no difference)
startup:insecure-bank:tracing:Agent	13.00 s	12.94 s	[-0.7%; +1.6%] (no difference)
startup:petclinic:appsec:Agent	16.57 s	16.31 s	[+0.4%; +2.7%] (maybe worse)
startup:petclinic:iast:Agent	16.61 s	16.58 s	[-1.1%; +1.5%] (no difference)
startup:petclinic:profiling:Agent	16.38 s	16.52 s	[-2.3%; +0.6%] (no difference)
startup:petclinic:tracing:Agent	14.90 s	15.67 s	[-12.6%; +2.8%] (unstable)

Commit: 65a53a69 · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128 on normal heap and to 64 at Xmx < 128 MB, and the implicit TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap. Why --- The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled. At Xmx <= ~128 MB the G1 survivor region is too small to absorb that footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic where the inbox overflowed young gen and triggered To-space Exhausted → Full GC storms (0 r/s in the worst case). New defaults bound the worst-case in-flight footprint at ~1 MB on normal heap and ~500 KB at tight heap, comfortably below typical survivor sizes and large enough to absorb the sub-second consumer stalls we actually see in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default). Customers who explicitly configure TRACER_METRICS_MAX_PENDING are unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides. Only the implicit defaults shrink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9acd8d1d5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T16:53:30Z

+    final int defaultMaxAggregates = tightHeap ? 256 : 2048;
+    final int defaultMaxPending = tightHeap ? 64 : 128;


Sync metadata with the new tracer-metrics defaults

Changing the implicit default here leaves metadata/supported-configurations.json advertising DD_TRACE_TRACER_METRICS_MAX_PENDING and DD_TRACE_TRACER_METRICS_MAX_AGGREGATES as 2048. That file is the source used for supported-configuration metadata/docs, so users and config-inversion tooling will still see the old defaults even though normal heaps now get 128 pending and tight heaps get 64/256. Please update the metadata entry (or otherwise represent the heap-dependent default) along with this runtime change.

Useful? React with 👍 / 👎.

From Claude...

Thanks — pushed 9ab5e58e4e updating DD_TRACE_TRACER_METRICS_MAX_PENDING from 2048 → 128 to match the new normal-heap default.

DD_TRACE_TRACER_METRICS_MAX_AGGREGATES is left at 2048: that's still the normal-heap default (only the tight-heap branch changes it to 256). The current schema is {version, type, default, aliases} with a single string default — there's no way to encode "heap-dependent default" without a schema extension (e.g. a defaultExpression field or defaults: [{when, value}] shape). So both keys' tight-heap branches stay unrepresented in metadata; the typical customer (Xmx ≥ 128 MB) sees the documented value.

If a schema extension is in scope, happy to take that as a follow-up.

The default cut from 2048 → 128 needs the matching entry in metadata/supported-configurations.json so config-inversion tooling and supported-configuration docs reflect the new value. DD_TRACE_TRACER_METRICS_MAX_AGGREGATES is left at 2048: the normal-heap default is unchanged. The metadata schema doesn't support heap-dependent defaults, so the tight-heap branch (64 / 256) isn't representable; the metadata reflects the normal-heap default that applies to the typical customer (Xmx >= 128 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Default changed from 2048 to 128; version field must be incremented when the default value changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dd-octo-sts · 2026-05-29T21:04:49Z

/merge

gh-worker-devflow-routing-ef8351 · 2026-05-29T21:04:53Z

View all feedbacks in Devflow UI.

2026-05-29 21:04:53 UTC ℹ️ Start processing command /merge

2026-05-29 21:04:57 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).

2026-05-29 23:05:30 UTC ❌ MergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 6485393595883034246 took longer than expected. The current limit for the base branch 'master' is 120 minutes.

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026

This comment has been minimized.

Sign in to view

dougqh marked this pull request as ready for review May 29, 2026 16:51

dougqh force-pushed the dougqh/css-tight-heap-defaults branch from 391704f to a9acd8d Compare May 29, 2026 16:51

dougqh requested a review from a team as a code owner May 29, 2026 16:51

dougqh requested a review from manuel-alvarez-alvarez May 29, 2026 16:51

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

dougqh requested a review from a team as a code owner May 29, 2026 17:08

dougqh requested review from AlexeyKuznetsov-DD, bric3 and sarahchen6 and removed request for a team May 29, 2026 17:08

sarahchen6 approved these changes May 29, 2026

View reviewed changes

AlexeyKuznetsov-DD reviewed May 29, 2026

View reviewed changes

Comment thread metadata/supported-configurations.json

chore(metadata): bump DD_TRACE_TRACER_METRICS_MAX_PENDING version to B

2b3c494

Default changed from 2048 to 128; version field must be incremented when the default value changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DataDog deleted a comment from datadog-datadog-prod-us1 Bot May 29, 2026

Merge branch 'master' into dougqh/css-tight-heap-defaults

c2d6fac

AlexeyKuznetsov-DD approved these changes May 29, 2026

View reviewed changes

dougqh enabled auto-merge May 29, 2026 18:21

dougqh added 2 commits May 29, 2026 15:36

Merge branch 'master' into dougqh/css-tight-heap-defaults

e62ede6

Merge branch 'master' into dougqh/css-tight-heap-defaults

65a53a6

dougqh added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500

Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
dougqh wants to merge 6 commits into
masterfrom
dougqh/css-tight-heap-defaults

dougqh commented May 29, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

dougqh May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

dd-octo-sts Bot commented May 29, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		final int defaultMaxAggregates = tightHeap ? 256 : 2048;
		final int defaultMaxPending = tightHeap ? 64 : 128;

Conversation

dougqh commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

New defaults

What this is not

Validation

vs 1.62 baseline

vs merged master baseline (the gain this PR contributes)

Variance behavior at tight heap

Test plan

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dougqh May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dd-octo-sts Bot commented May 29, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dougqh commented May 29, 2026 •

edited

Loading

dd-octo-sts Bot commented May 29, 2026 •

edited

Loading

dougqh May 29, 2026 •

edited

Loading

gh-worker-devflow-routing-ef8351 Bot commented May 29, 2026 •

edited

Loading