cmd/compile/internal,internal,runtime: add lightweight probabilistic data race detection#77793
Open
VladSaiocUber wants to merge 40 commits intogolang:masterfrom
Open
cmd/compile/internal,internal,runtime: add lightweight probabilistic data race detection#77793VladSaiocUber wants to merge 40 commits intogolang:masterfrom
VladSaiocUber wants to merge 40 commits intogolang:masterfrom
Conversation
…robabilistic race detection
This is a runnable but rough prototype. DO NOT REVIEW.
Earlier today, Milind Chabbi described adding micropauses to detect
concurrent memory accesses via hardware debug registers and PMUs.
This CL is inspired by those micropauses, but instead attempts software
cheap probabilistic race detection via just the Go compiler & runtime.
Sample report:
racelite: multiple concurrent writes to same object detected
racelite: write flag was changed for addr 0x17578e4f6230 at word 2 of object at base 0x17578e4f6220
panic: racelite: multiple concurrent writes to same object detecte
This CL updates the compiler for instrument writes (and eventually
reads) with some new runtime support for software-based best effort
race detection.
That report was generated by running this program after building it
with the prototype:
type Foo struct{ a, b, c int }
p := &Foo{}
// Do two racy writes.
var wg sync.WaitGroup
for range 2 {
wg.Go(func() {
for range 10000 {
// Without a mutex, both goroutines have a data race here.
// We detect the race in ~1% of these short program executions
// (with higher or lower detection percentages depending how we
// adjust our sampling parameters).
p.c = 2
}
})
}
The high-level approach is that we attempt to have a global agreement
across all user-goroutines about which words are being monitored at
any given time (in the prototype, currently one out of 1024 addresses at
any moment, but that is tunable based on overhead vs. detection rates),
and then have a scheme to move from a word being written to a location
we can record that it is being written. Given there is global agreement
on how to do that, two distinct goroutines can both record that they
are writing to the same word of an object and throw a best effort
fatal error, similar to how the runtime maps can throw a best
effort fatal error on concurrent map writes. (The maps have an
advantage that there is a convenient struct to store the 'writing'
field; we do not have that for arbitrary heap objects, so hence
the scheme described in this CL).
In addition, that global agreement is not fixed, but rather updated
overtime so that a continually different subset of addresses can be
monitored over the lifetime of a program. In particular, we update the
global agreement during STW so that the agreement can change
coherently across all user goroutines.
Some additional details about the prototype:
* We have two levels of sampling to reduce overhead.
* There are two random 32-bit numbers initialized at program start.
* When a write occurs, we use a mask of the address to compare against
one of the random numbers to decide whether to do a the work of
finding the heap object's bounds.
* If that happens, we then use the second 32-bit random number to decide
if we are monitoring the accessed word of the object. Each heap object
can only have one word monitored at a time. This allows the storage
location for the monitoring state can be per object without conflict
between words of the same object. (Currently, this is a footer added
to the heap object, but perhaps could be per span instead, or a
different scheme).
* These two random numbers are updated every STW. This is a
convenient moment in time for us, thanks to the GC.
* Writes XOR 1 into the monitored object's footer. The same bit is
written regardless of which word in the object is being written.
* To help with detection, it also does a microsleep for currently 1 out
of every 1M writes.
* The prototype currently only detects concurrent writes, but it is
hopefully a small step to add in reads as well.
This is very much just a first cut. All the numbers were chosen
for now with the desire to see something work via short/quick tests,
and are not necessarily reflective of what real numbers should be.
Returning to our sample program, if we bracket the racy write with a
mu.Lock and mu.Unlock, no race is reported, as expected.
Because we are tracking each word of each allocated object separately,
writing to separate words of the same heap object does not report
a race either, also as desired:
wg.Go(func() {
for range 100000 {
// Not a race because this goroutine is only writing to field b.
p.b = 1
}
})
wg.Go(func() {
for range 100000 {
// Not a race because this goroutine is only writing to field c.
p.c = 1
}
})
Some possible refinements:
For simplicity, that mask check on the address happens
in the runtime in my quick prototype, but I suspect
it would not be too bad to instead be a small amount
of code emited by the runtime (do the mask on the addr,
check the result, and do nothing the large majority of
the time; rarely, the mask check would cause it to call
into the runtime.)
Some ideas for reducing the memory overhead are that
rather than having a word per heap object (which might
be something like 5-10% memory overhead for a typical
app), it could be one word per span, perhaps stored
on the mspan struct or in the foot of the span, which
then might be more something like 0.1% memory overhead
(e.g., 8 bytes for an 8KiB span).
Some ideas for reducing the CPU overhead could be that
the instrumentation in the compiler itself could be
probabalisitic. For example, if each location had
a 1/8 chance of being instrumented based on some
randomization flag, then the CPU overhead could likely
drop proportionally at the cost then of needing to
deploy multiple binaries to get a mix of instrumentation
sites.
Another approach might be a simpler boolean check
in the instrumentation site, which could then be enabled
globally only some percentage of the time (say, 1% or 5%),
where checking the boolean might place it in the ball park
of the current disabled write barrier check, which is considered
small. Or perhaps the instrumentation could be behind the
write barrier check, so it only checks when the write barrier
is enabled but has ~0 additional cost when the write barrier
is disabled.
Change-Id: Id95f5a3db1f1b56f13362f2529b0887ddcbe6971
… stack, S2 is second temporal stack)
Contributor
|
This PR (HEAD: ec63211) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/748980. Important tips:
|
Contributor
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/748980. |
Contributor
|
Message from Georgian-Vlad Saioc: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/748980. |
In some misguided attempt at "cleanup", Google Cloud has decided to retire 'gsutil' in favor of 'gcloud storage' instead of leaving an entirely backwards-compatible wrapper so that client scripts and muscle memory keep working. In addition to breaking customers this way, they are also sending AI bots around "cleaning up" old usages with scary warnings that maybe the changes will break your entire world. This is even more misguided, of course, and resulted in us receiving CL 748820 (originally GitHub PR golang#77787) and then me receiving a private email asking for it to be merged. It was easier to recreate the 3-line CL myself than to enumerate everything that was wrong with that CL's commit message. I hope that only Google teams are being subjected to this. Fixes golang#77787. Change-Id: I624ad4aff7f488b191e195b72e564dbd78440883 Reviewed-on: https://go-review.googlesource.com/c/go/+/748900 TryBot-Bypass: Russ Cox <rsc@golang.org> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Alan Donovan <adonovan@google.com> Reviewed-by: Jonathan Amsterdam <jba@google.com>
Contributor
|
This PR (HEAD: 0d44bf5) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/748980. Important tips:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DO NOT REVIEW (yet)
Expanded upon lightweight probabilistic data race detection (Racelite) https://go-review.googlesource.com/c/go/+/718640