Skip to content

cmd/compile/internal,internal,runtime: add lightweight probabilistic data race detection#77793

Open
VladSaiocUber wants to merge 40 commits intogolang:masterfrom
VladSaiocUber:dev.sw-race
Open

cmd/compile/internal,internal,runtime: add lightweight probabilistic data race detection#77793
VladSaiocUber wants to merge 40 commits intogolang:masterfrom
VladSaiocUber:dev.sw-race

Conversation

@VladSaiocUber
Copy link
Contributor

DO NOT REVIEW (yet)

Expanded upon lightweight probabilistic data race detection (Racelite) https://go-review.googlesource.com/c/go/+/718640

thepudds and others added 30 commits February 25, 2026 14:47
…robabilistic race detection

This is a runnable but rough prototype. DO NOT REVIEW.

Earlier today, Milind Chabbi described adding micropauses to detect
concurrent memory accesses via hardware debug registers and PMUs.

This CL is inspired by those micropauses, but instead attempts software
cheap probabilistic race detection via just the Go compiler & runtime.

Sample report:

  racelite: multiple concurrent writes to same object detected
  racelite: write flag was changed for addr 0x17578e4f6230 at word 2 of object at base 0x17578e4f6220
  panic: racelite: multiple concurrent writes to same object detecte

This CL updates the compiler for instrument writes (and eventually
reads) with some new runtime support for software-based best effort
race detection.

That report was generated by running this program after building it
with the prototype:

  type Foo struct{ a, b, c int }
  p := &Foo{}

	// Do two racy writes.
	var wg sync.WaitGroup
	for range 2 {
		wg.Go(func() {
			for range 10000 {
				// Without a mutex, both goroutines have a data race here.
				// We detect the race in ~1% of these short program executions
        // (with higher or lower detection percentages depending how we
        // adjust our sampling parameters).
				p.c = 2
			}
		})
	}

The high-level approach is that we attempt to have a global agreement
across all user-goroutines about which words are being monitored at
any given time (in the prototype, currently one out of 1024 addresses at
any moment, but that is tunable based on overhead vs. detection rates),
and then have a scheme to move from a word being written to a location
we can record that it is being written. Given there is global agreement
on how to do that, two distinct goroutines can both record that they
are writing to the same word of an object and throw a best effort
fatal error, similar to how the runtime maps can throw a best
effort fatal error on concurrent map writes. (The maps have an
advantage that there is a convenient struct to store the 'writing'
field; we do not have that for arbitrary heap objects, so hence
the scheme described in this CL).

In addition, that global agreement is not fixed, but rather updated
overtime so that a continually different subset of addresses can be
monitored over the lifetime of a program. In particular, we update the
global agreement during STW so that the agreement can change
coherently across all user goroutines.

Some additional details about the prototype:

* We have two levels of sampling to reduce overhead.
* There are two random 32-bit numbers initialized at program start.
* When a write occurs, we use a mask of the address to compare against
  one of the random numbers to decide whether to do a the work of
  finding the heap object's bounds.
* If that happens, we then use the second 32-bit random number to decide
  if we are monitoring the accessed word of the object. Each heap object
  can only have one word monitored at a time. This allows the storage
  location for the monitoring state can be per object without conflict
  between words of the same object. (Currently, this is a footer added
  to the heap object, but perhaps could be per span instead, or a
  different scheme).
* These two random numbers are updated every STW. This is a
  convenient moment in time for us, thanks to the GC.
* Writes XOR 1 into the monitored object's footer. The same bit is
  written regardless of which word in the object is being written.
* To help with detection, it also does a microsleep for currently 1 out
  of every 1M writes.
* The prototype currently only detects concurrent writes, but it is
  hopefully a small step to add in reads as well.

This is very much just a first cut. All the numbers were chosen
for now with the desire to see something work via short/quick tests,
and are not necessarily reflective of what real numbers should be.

Returning to our sample program, if we bracket the racy write with a
mu.Lock and mu.Unlock, no race is reported, as expected.

Because we are tracking each word of each allocated object separately,
writing to separate words of the same heap object does not report
a race either, also as desired:

	wg.Go(func() {
		for range 100000 {
			// Not a race because this goroutine is only writing to field b.
			p.b = 1

		}
	})
	wg.Go(func() {
		for range 100000 {
			// Not a race because this goroutine is only writing to field c.
			p.c = 1
		}
	})

Some possible refinements:

For simplicity, that mask check on the address happens
in the runtime in my quick prototype, but I suspect
it would not be too bad to instead be a small amount
of code emited by the runtime (do the mask on the addr,
check the result, and do nothing the large majority of
the time; rarely, the mask check would cause it to call
into the runtime.)

Some ideas for reducing the memory overhead are that
rather than having a word per heap object (which might
be something like 5-10% memory overhead for a typical
app), it could be one word per span, perhaps stored
on the mspan struct or in the foot of the span, which
then might be more something like 0.1% memory overhead
(e.g., 8 bytes for an 8KiB span).

Some ideas for reducing the CPU overhead could be that
the instrumentation in the compiler itself could be
probabalisitic. For example, if each location had
a 1/8 chance of being instrumented based on some
randomization flag, then the CPU overhead could likely
drop proportionally at the cost then of needing to
deploy multiple binaries to get a mix of instrumentation
sites.

Another approach might be a simpler boolean check
in the instrumentation site, which could then be enabled
globally only some percentage of the time (say, 1% or 5%),
where checking the boolean might place it in the ball park
of the current disabled write barrier check, which is considered
small. Or perhaps the instrumentation could be behind the
write barrier check, so it only checks when the write barrier
is enabled but has ~0 additional cost when the write barrier
is disabled.

Change-Id: Id95f5a3db1f1b56f13362f2529b0887ddcbe6971
@gopherbot
Copy link
Contributor

This PR (HEAD: ec63211) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/748980.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot
Copy link
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/748980.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Georgian-Vlad Saioc:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/748980.
After addressing review feedback, remember to publish your drafts!

rsc and others added 2 commits February 26, 2026 13:33
In some misguided attempt at "cleanup", Google Cloud has
decided to retire 'gsutil' in favor of 'gcloud storage' instead
of leaving an entirely backwards-compatible wrapper so
that client scripts and muscle memory keep working.

In addition to breaking customers this way, they are also
sending AI bots around "cleaning up" old usages with scary
warnings that maybe the changes will break your entire world.
This is even more misguided, of course, and resulted in us
receiving CL 748820 (originally GitHub PR golang#77787)
and then me receiving a private email asking for it to be merged.

It was easier to recreate the 3-line CL myself than to
enumerate everything that was wrong with that CL's
commit message.

I hope that only Google teams are being subjected to this.

Fixes golang#77787.

Change-Id: I624ad4aff7f488b191e195b72e564dbd78440883
Reviewed-on: https://go-review.googlesource.com/c/go/+/748900
TryBot-Bypass: Russ Cox <rsc@golang.org>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Alan Donovan <adonovan@google.com>
Reviewed-by: Jonathan Amsterdam <jba@google.com>
@gopherbot
Copy link
Contributor

This PR (HEAD: 0d44bf5) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/748980.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants