[FIX] Fix race conditions in Auditor bookie change detection#4728
Draft
lhotari wants to merge 5 commits intoapache:masterfrom
Draft
[FIX] Fix race conditions in Auditor bookie change detection#4728lhotari wants to merge 5 commits intoapache:masterfrom
lhotari wants to merge 5 commits intoapache:masterfrom
Conversation
## Problem Investigating CI flakiness in `BookieAutoRecoveryTest` revealed three related race conditions in `Auditor` that could cause a bookie departure to go undetected until the next periodic audit run (default: 86400 s), causing tests that wait up to 60 s for underreplicated-ledger marking to time out. ### Race 1 – watcher-provided bookie set was discarded `watchBookieChanges()` passed `bookies -> submitAuditTask()` as the listener, silently ignoring the current bookie set delivered by the ZK watcher. Inside `submitAuditTask()` a fresh `getAvailableBookies()` ZK call was made from an executor lambda that could run well after the watcher fired (e.g. while `checkAllLedgers` was occupying the single-threaded executor). In that window the two views of the cluster could diverge, causing `bookiesToBeAudited` to end up empty and the departure to be silently skipped. Fix: capture `pendingWritableBookies` / `pendingReadOnlyBookies` atomically in the watcher callback and read them in `runAuditTask()`, falling back to `getAvailableBookies()` only when the watchers have not yet fired (e.g. direct test calls via `submitAuditTask()`). ### Race 2 – duplicate audit tasks could be queued per change type Rapid successive watcher callbacks (or a slow executor) could queue many identical audit tasks, each making its own ZK calls, wasting resources and risking interleaved state updates to `knownBookies` / `bookiesToBeAudited`. Fix: separate `AtomicBoolean` queued-flags (`writableAuditTaskQueued`, `readOnlyAuditTaskQueued`) ensure at most one task is *queued* per change type at a time. The flag is cleared at the *start* of the task (not at submission), so watcher callbacks that arrive while a task is already running queue exactly one follow-up task rather than being silently dropped. ### Race 3 – knownBookies initialised after watcher registration `start()` called `watchBookieChanges()` before assigning `knownBookies`. A watcher callback firing immediately (the ZK registration client fires listeners synchronously with the current bookie set when one is already cached) could race with the subsequent `knownBookies = admin.getAllBookies()` assignment. Although the existing `synchronized` on `start()` and `submitAuditTask()` prevented the *lambda* from running before `knownBookies` was set, the field itself was not `volatile`, leaving a visibility gap for the executor thread. Fix: swap the initialisation order so `knownBookies` is populated before `watchBookieChanges()` is called, and declare the field `volatile` to guarantee cross-thread visibility of the reference assignment. ## Changes - `Auditor.knownBookies`: add `volatile` - New fields `pendingWritableBookies`, `pendingReadOnlyBookies` (`AtomicReference<Set<String>>`): hold the latest bookie set from each watcher type - New fields `writableAuditTaskQueued`, `readOnlyAuditTaskQueued` (`AtomicBoolean`): deduplication flags per change type - `watchBookieChanges()`: populate the pending sets and call new `submitAuditTaskForBookieChange(boolean)` instead of `submitAuditTask()` - `submitAuditTaskForBookieChange(boolean)`: queues a task only when no task of that type is already queued; clears the flag at task start - `submitAuditTask()`: unchanged public/test API; now delegates to the extracted `runAuditTask()` - `runAuditTask()`: extracted audit logic; uses watcher-captured bookie sets when available, falls back to `getAvailableBookies()` - `start()`: initialise `knownBookies` before `watchBookieChanges()` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cation The previous AtomicBoolean approach (flag cleared at task *start*) had a subtle race: between the flag being cleared and `runAuditTask()` actually reading `pendingWritableBookies`, a watcher callback could see `flag=false` and submit a second concurrent task. Because the flag gives no information about a task that is already *running*, the invariant "at most one queued + one in-progress" was not cleanly enforced. Replace with an `AtomicInteger` counter (max 2) that is: - incremented synchronously at submission time (inside the existing `synchronized` block that also protects the `isShutdown()` check) - decremented in a `finally` block when the task *finishes* This accurately counts all live tasks (running or queued) for the full duration of their lifetime. When the counter reaches 2 (one in-progress + one queued), further watcher callbacks are dropped — the queued task will read the latest `pendingWritableBookies` / `pendingReadOnlyBookies` when it eventually executes, so no additional task provides any benefit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… absent The previous fallback in runAuditTask() called getAvailableBookies() (both writable + readonly ZK reads) whenever *either* pending snapshot was null. This was wrong: writable and readonly watcher callbacks are independent and fire separately, so one snapshot can be present while the other is not yet populated (e.g. right after startup when only one type of change has occurred). Replace the single else-branch with three cases: - Both snapshots present: combine them (no ZK call, as before). - Only writable snapshot present: use it + call admin.getReadOnlyBookies(). - Only readonly snapshot present: call admin.getAvailableBookies() + use it. - Neither present (direct test calls via submitAuditTask()): call the full getAvailableBookies() as before. This ensures we always use the watcher-captured data for the half that is available and only make a targeted ZK round-trip for the missing half. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…knownBookies
submitAuditTask() was submitting runAuditTask(), which compares
pendingWritableBookies against knownBookies to detect lost bookies.
knownBookies is seeded from admin.getAllBookies() at auditor startup, so
any bookie that registers after the auditor started and dies before being
processed by runAuditTask() is never added to knownBookies and remains
invisible to the subtraction-based loss detection.
This caused testEmptyLedgerLosesQuorumEventually to fail when run after
testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor
42899 started, two watcher tasks were queued (one per registration), and
by the time they ran (after checkAllLedgers completed 112ms later) bookie
36375 had already died. pendingWritableBookies correctly reflected
{42899,33105} but since 36375 was never in knownBookies the loss went
undetected. The explicit submitAuditTask().get() call in the test suffered
the same gap.
Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false)
instead of runAuditTask(). The full auditBookies() scan uses ledger metadata
to find all bookies that own ledgers and checks their availability, detecting
36375 via the ledger it appears in regardless of knownBookies state.
The runAuditTask() / knownBookies mechanism is still correct and sufficient
for its intended purpose: detecting losses for bookies that were already
known. The register-and-die edge case is an inherent limitation of that
lightweight approach and is handled by the periodic auditorBookieCheckTask
and by explicit submitAuditTask() calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
Author
|
Although this fixes the flaky tests, there are multiple remaining race conditions in the auditor which could make it inoperational. For example, this doesn't look right for all cases: This is also surprising: If the first time fails, registering the watcher would be silently ignored. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
While investigating a CI flakiness in
BookieAutoRecoveryTest(PR #4711) — where tests waiting up to 60 s for ledgers to be marked as underreplicated would time out — several related race conditions inAuditorwere found that could allow a bookie departure to go undetected until the next periodic audit run (default: 86400 s).Analysis of the surefire logs for the failing CI run showed a ~70-second silence from the auditor after a bookie was killed, followed by a new auditor election that eventually caught the departure. This pointed to the auditor's bookie-change detection path rather than a test-setup issue.
Races fixed
Race 1 – watcher-provided bookie set was discarded
watchBookieChanges()registeredbookies -> submitAuditTask()as the ZK listener, ignoring the current bookie set already delivered by the callback. InsidesubmitAuditTask()a freshgetAvailableBookies()ZK call was made from an executor lambda that could execute well after the watcher fired — for example whileAuditorCheckAllLedgersTaskwas occupying the single-threaded executor. In that window the two snapshots of the cluster could diverge, causingbookiesToBeAuditedto remain empty and the departure to be silently skipped.Fix: capture
pendingWritableBookies/pendingReadOnlyBookiesatomically in each watcher callback and consume them in the extractedrunAuditTask(). Fall back to a livegetAvailableBookies()call only when one watcher type has not yet fired (to avoid an unnecessary ZK round-trip for the already-known half).Race 2 – duplicate audit tasks could be queued per change type
Rapid successive watcher callbacks (or a slow executor) could enqueue many identical audit tasks, each making its own ZK calls and risking interleaved mutations to
knownBookies/bookiesToBeAudited.Fix: replace the original
AtomicBooleanqueued-flag with anAtomicIntegercounter (writableAuditTaskCount,readOnlyAuditTaskCount). The counter tracks running + queued tasks per change type and is capped atMAX_AUDIT_TASKS_PER_TYPE = 2(one in-progress + one queued). The queued task always reads the latestpendingWritableBookies/pendingReadOnlyBookieswhen it starts, so queuing more than one additional task would only duplicate work without improving correctness.Race 3 –
knownBookiesinitialised after watcher registrationstart()calledwatchBookieChanges()before assigningknownBookies. A watcher callback firing immediately could race with the subsequentknownBookies = admin.getAllBookies()assignment.Fix: swap the initialisation order so
knownBookiesis populated beforewatchBookieChanges()is called.Race 4 –
submitAuditTask()could not detect bookies that registered and died after auditor startupsubmitAuditTask()(used by tests to explicitly trigger and wait for an audit) submittedrunAuditTask(), which detects bookie losses by comparingpendingWritableBookiesagainstknownBookies.knownBookiesis seeded fromadmin.getAllBookies()at auditor startup, so any bookie that registered after the auditor started and died before being processed byrunAuditTask()was never added toknownBookiesand was invisible to the subtraction-based detection — regardless of the task cap.This caused
testEmptyLedgerLosesQuorumEventuallyto fail when run aftertestNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor 42899 started; two watcher tasks were queued and were both blocked behindcheckAllLedgers(112 ms); bookie 36375 died during that window; by the time the tasks ran,pendingWritableBookiescorrectly reflected{42899, 33105}but since 36375 was never inknownBookiesthe loss went undetected. IncreasingMAX_AUDIT_TASKS_PER_TYPEdoes not help here — all tasks readpendingWritableBookies.get()when they execute, so they always see the latest snapshot (without 36375), and no amount of additional queued tasks changes that.Fix:
submitAuditTask()now submitsauditorBookieCheckTask.startAudit(false)— the full ledger-metadata-based audit — rather thanrunAuditTask(). The full scan detects any bookie that appears in ledger ensembles but is no longer available, regardless ofknownBookiesstate.The
runAuditTask()/knownBookiesmechanism remains correct and sufficient for its intended purpose: detecting losses for bookies that were already known. The register-and-die edge case is an inherent limitation of the lightweight watcher-based approach and is handled by the periodicauditorBookieCheckTaskand by explicitsubmitAuditTask()calls.Changes summary
Auditor.knownBookiesvolatile; initialise beforewatchBookieChanges()pendingWritableBookies,pendingReadOnlyBookiesAtomicReference<Set<String>>fields holding the latest bookie set per watcher typewritableAuditTaskCount,readOnlyAuditTaskCountAtomicIntegercounters (running+queued per type, capped at 2) replacing the originalAtomicBooleanflagswatchBookieChanges()submitAuditTaskForBookieChange(boolean)submitAuditTaskForBookieChange(boolean)runAuditTaskonly when the per-type count allowssubmitAuditTask()auditorBookieCheckTask.startAudit(false)(full audit via ledger metadata) instead ofrunAuditTask()runAuditTask()start()knownBookiesbefore callingwatchBookieChanges()Testing
Existing tests in
BookieAutoRecoveryTestexercise all affected paths and continue to pass. The previously flaky test (testEmptyLedgerLosesQuorumEventually) now passes reliably when run as part of the full test class.