Skip to content

fix(tsan): atomic reads in findCallTrace to eliminate data race#559

Merged
jbachorik merged 3 commits into
mainfrom
jb/fix-calltrace-tsan-race
May 29, 2026
Merged

fix(tsan): atomic reads in findCallTrace to eliminate data race#559
jbachorik merged 3 commits into
mainfrom
jb/fix-calltrace-tsan-race

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

What does this PR do?:
Replaces two plain (non-atomic) reads of keys[slot] and one plain .trace field access in CallTraceHashTable::findCallTrace with proper atomic loads, and adds a regression test that exercises the race scenario.

Motivation:
TSan on aarch64 reported a data race between:

  • findCallTrace reading keys[slot] with a plain load (line 235)
  • put writing keys[slot] with an atomic CAS (line 319)

The race is real: when a table is promoted to prev after expansion, threads that loaded the old table pointer before the CAS swap continue inserting into it atomically, while threads working on the new table call findCallTrace(prev, hash) with non-atomic reads. On weakly-ordered architectures such as aarch64 this is observable.

Two fixes in findCallTrace:

  1. keys[slot] reads → __atomic_load_n(&keys[slot], __ATOMIC_ACQUIRE) (pairs with the ACQ_REL CAS in put)
  2. values()[slot].trace plain access → values()[slot].acquireTrace() (pairs with the RELEASE store in setTrace). If the slot is still PREPARING, treat it as not found so callers create a fresh entry.

Additional Notes:
The PREPARING sentinel is a non-null pointer; returning nullptr for an in-progress slot means the caller (put) creates a new CallTrace for that hash. Duplicate entries across the old and new table for the same hash are benign because collect() deduplicates by inserting into a std::unordered_set.

How to test the change?:
New regression test FindCallTraceAtomicReadRaceTest in stress_callTraceStorage.cpp:

  • Pre-fills the hash table to 74 % of the 64 K initial capacity (just below the 75 % expansion threshold)
  • Releases 16 threads simultaneously so some cross the threshold, triggering expansion while others still hold a pointer to the old (now-prev) table
  • Under TSan this repros the original race if the atomic loads are removed

Run with TSan enabled:

./gradlew :ddprof-lib:testTSan --tests "*FindCallTraceAtomicReadRaceTest*"

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a TSan-reported race in CallTraceHashTable::findCallTrace by making reads from concurrently modified hash-table slots atomic, and adds a stress regression test intended to reproduce the expansion/prev-table race.

Changes:

  • Uses atomic acquire loads for keys[slot] in findCallTrace.
  • Uses acquireTrace() and handles PREPARING slots as not found.
  • Adds a concurrent stress test targeting the expansion race scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
ddprof-lib/src/main/cpp/callTraceHashTable.cpp Updates previous-table lookup to use atomic key and trace reads.
ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Adds a TSan regression stress test for concurrent expansion and lookup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

CI Test Results

Run: #26630635589 | Commit: bb28a2e | Duration: 14m 25s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-29 10:12:29 UTC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik marked this pull request as ready for review May 29, 2026 09:28
@jbachorik jbachorik requested a review from a team as a code owner May 29, 2026 09:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 475981b613

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Outdated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik merged commit 006259b into main May 29, 2026
108 checks passed
@jbachorik jbachorik deleted the jb/fix-calltrace-tsan-race branch May 29, 2026 09:53
@github-actions github-actions Bot added this to the 1.44.0 milestone May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants