[FIX] Fix race conditions in Auditor bookie change detection by lhotari · Pull Request #4728 · apache/bookkeeper

lhotari · 2026-03-13T09:36:05Z

Motivation

While investigating a CI flakiness in BookieAutoRecoveryTest (PR #4711) — where tests waiting up to 60 s for ledgers to be marked as underreplicated would time out — several related race conditions in Auditor were found that could allow a bookie departure to go undetected until the next periodic audit run (default: 86400 s).

Analysis of the surefire logs for the failing CI run showed a ~70-second silence from the auditor after a bookie was killed, followed by a new auditor election that eventually caught the departure. This pointed to the auditor's bookie-change detection path rather than a test-setup issue.

Races fixed

Race 1 – watcher-provided bookie set was discarded

watchBookieChanges() registered bookies -> submitAuditTask() as the ZK listener, ignoring the current bookie set already delivered by the callback. Inside submitAuditTask() a fresh getAvailableBookies() ZK call was made from an executor lambda that could execute well after the watcher fired — for example while AuditorCheckAllLedgersTask was occupying the single-threaded executor. In that window the two snapshots of the cluster could diverge, causing bookiesToBeAudited to remain empty and the departure to be silently skipped.

Fix: capture pendingWritableBookies / pendingReadOnlyBookies atomically in each watcher callback and consume them in the extracted runAuditTask(). Fall back to a live getAvailableBookies() call only when one watcher type has not yet fired (to avoid an unnecessary ZK round-trip for the already-known half).

Race 2 – duplicate audit tasks could be queued per change type

Rapid successive watcher callbacks (or a slow executor) could enqueue many identical audit tasks, each making its own ZK calls and risking interleaved mutations to knownBookies / bookiesToBeAudited.

Fix: replace the original AtomicBoolean queued-flag with an AtomicInteger counter (writableAuditTaskCount, readOnlyAuditTaskCount). The counter tracks running + queued tasks per change type and is capped at MAX_AUDIT_TASKS_PER_TYPE = 2 (one in-progress + one queued). The queued task always reads the latest pendingWritableBookies / pendingReadOnlyBookies when it starts, so queuing more than one additional task would only duplicate work without improving correctness.

Race 3 – `knownBookies` initialised after watcher registration

start() called watchBookieChanges() before assigning knownBookies. A watcher callback firing immediately could race with the subsequent knownBookies = admin.getAllBookies() assignment.

Fix: swap the initialisation order so knownBookies is populated before watchBookieChanges() is called.

Race 4 – `submitAuditTask()` could not detect bookies that registered and died after auditor startup

submitAuditTask() (used by tests to explicitly trigger and wait for an audit) submitted runAuditTask(), which detects bookie losses by comparing pendingWritableBookies against knownBookies. knownBookies is seeded from admin.getAllBookies() at auditor startup, so any bookie that registered after the auditor started and died before being processed by runAuditTask() was never added to knownBookies and was invisible to the subtraction-based detection — regardless of the task cap.

This caused testEmptyLedgerLosesQuorumEventually to fail when run after testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor 42899 started; two watcher tasks were queued and were both blocked behind checkAllLedgers (112 ms); bookie 36375 died during that window; by the time the tasks ran, pendingWritableBookies correctly reflected {42899, 33105} but since 36375 was never in knownBookies the loss went undetected. Increasing MAX_AUDIT_TASKS_PER_TYPE does not help here — all tasks read pendingWritableBookies.get() when they execute, so they always see the latest snapshot (without 36375), and no amount of additional queued tasks changes that.

Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false) — the full ledger-metadata-based audit — rather than runAuditTask(). The full scan detects any bookie that appears in ledger ensembles but is no longer available, regardless of knownBookies state.

The runAuditTask() / knownBookies mechanism remains correct and sufficient for its intended purpose: detecting losses for bookies that were already known. The register-and-die edge case is an inherent limitation of the lightweight watcher-based approach and is handled by the periodic auditorBookieCheckTask and by explicit submitAuditTask() calls.

Changes summary

What	Change
`Auditor.knownBookies`	Add `volatile`; initialise before `watchBookieChanges()`
`pendingWritableBookies`, `pendingReadOnlyBookies`	New `AtomicReference<Set<String>>` fields holding the latest bookie set per watcher type
`writableAuditTaskCount`, `readOnlyAuditTaskCount`	New `AtomicInteger` counters (running+queued per type, capped at 2) replacing the original `AtomicBoolean` flags
`watchBookieChanges()`	Populate the pending sets and call `submitAuditTaskForBookieChange(boolean)`
`submitAuditTaskForBookieChange(boolean)`	Queues `runAuditTask` only when the per-type count allows
`submitAuditTask()`	Now submits `auditorBookieCheckTask.startAudit(false)` (full audit via ledger metadata) instead of `runAuditTask()`
`runAuditTask()`	Uses watcher-captured bookie sets; falls back to live ZK only when one watcher type is absent
`start()`	Initialise `knownBookies` before calling `watchBookieChanges()`

Testing

Existing tests in BookieAutoRecoveryTest exercise all affected paths and continue to pass. The previously flaky test (testEmptyLedgerLosesQuorumEventually) now passes reliably when run as part of the full test class.

## Problem Investigating CI flakiness in `BookieAutoRecoveryTest` revealed three related race conditions in `Auditor` that could cause a bookie departure to go undetected until the next periodic audit run (default: 86400 s), causing tests that wait up to 60 s for underreplicated-ledger marking to time out. ### Race 1 – watcher-provided bookie set was discarded `watchBookieChanges()` passed `bookies -> submitAuditTask()` as the listener, silently ignoring the current bookie set delivered by the ZK watcher. Inside `submitAuditTask()` a fresh `getAvailableBookies()` ZK call was made from an executor lambda that could run well after the watcher fired (e.g. while `checkAllLedgers` was occupying the single-threaded executor). In that window the two views of the cluster could diverge, causing `bookiesToBeAudited` to end up empty and the departure to be silently skipped. Fix: capture `pendingWritableBookies` / `pendingReadOnlyBookies` atomically in the watcher callback and read them in `runAuditTask()`, falling back to `getAvailableBookies()` only when the watchers have not yet fired (e.g. direct test calls via `submitAuditTask()`). ### Race 2 – duplicate audit tasks could be queued per change type Rapid successive watcher callbacks (or a slow executor) could queue many identical audit tasks, each making its own ZK calls, wasting resources and risking interleaved state updates to `knownBookies` / `bookiesToBeAudited`. Fix: separate `AtomicBoolean` queued-flags (`writableAuditTaskQueued`, `readOnlyAuditTaskQueued`) ensure at most one task is *queued* per change type at a time. The flag is cleared at the *start* of the task (not at submission), so watcher callbacks that arrive while a task is already running queue exactly one follow-up task rather than being silently dropped. ### Race 3 – knownBookies initialised after watcher registration `start()` called `watchBookieChanges()` before assigning `knownBookies`. A watcher callback firing immediately (the ZK registration client fires listeners synchronously with the current bookie set when one is already cached) could race with the subsequent `knownBookies = admin.getAllBookies()` assignment. Although the existing `synchronized` on `start()` and `submitAuditTask()` prevented the *lambda* from running before `knownBookies` was set, the field itself was not `volatile`, leaving a visibility gap for the executor thread. Fix: swap the initialisation order so `knownBookies` is populated before `watchBookieChanges()` is called, and declare the field `volatile` to guarantee cross-thread visibility of the reference assignment. ## Changes - `Auditor.knownBookies`: add `volatile` - New fields `pendingWritableBookies`, `pendingReadOnlyBookies` (`AtomicReference<Set<String>>`): hold the latest bookie set from each watcher type - New fields `writableAuditTaskQueued`, `readOnlyAuditTaskQueued` (`AtomicBoolean`): deduplication flags per change type - `watchBookieChanges()`: populate the pending sets and call new `submitAuditTaskForBookieChange(boolean)` instead of `submitAuditTask()` - `submitAuditTaskForBookieChange(boolean)`: queues a task only when no task of that type is already queued; clears the flag at task start - `submitAuditTask()`: unchanged public/test API; now delegates to the extracted `runAuditTask()` - `runAuditTask()`: extracted audit logic; uses watcher-captured bookie sets when available, falls back to `getAvailableBookies()` - `start()`: initialise `knownBookies` before `watchBookieChanges()` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cation The previous AtomicBoolean approach (flag cleared at task *start*) had a subtle race: between the flag being cleared and `runAuditTask()` actually reading `pendingWritableBookies`, a watcher callback could see `flag=false` and submit a second concurrent task. Because the flag gives no information about a task that is already *running*, the invariant "at most one queued + one in-progress" was not cleanly enforced. Replace with an `AtomicInteger` counter (max 2) that is: - incremented synchronously at submission time (inside the existing `synchronized` block that also protects the `isShutdown()` check) - decremented in a `finally` block when the task *finishes* This accurately counts all live tasks (running or queued) for the full duration of their lifetime. When the counter reaches 2 (one in-progress + one queued), further watcher callbacks are dropped — the queued task will read the latest `pendingWritableBookies` / `pendingReadOnlyBookies` when it eventually executes, so no additional task provides any benefit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… absent The previous fallback in runAuditTask() called getAvailableBookies() (both writable + readonly ZK reads) whenever *either* pending snapshot was null. This was wrong: writable and readonly watcher callbacks are independent and fire separately, so one snapshot can be present while the other is not yet populated (e.g. right after startup when only one type of change has occurred). Replace the single else-branch with three cases: - Both snapshots present: combine them (no ZK call, as before). - Only writable snapshot present: use it + call admin.getReadOnlyBookies(). - Only readonly snapshot present: call admin.getAvailableBookies() + use it. - Neither present (direct test calls via submitAuditTask()): call the full getAvailableBookies() as before. This ensures we always use the watcher-captured data for the half that is available and only make a targeted ZK round-trip for the missing half. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nd IPs

…knownBookies submitAuditTask() was submitting runAuditTask(), which compares pendingWritableBookies against knownBookies to detect lost bookies. knownBookies is seeded from admin.getAllBookies() at auditor startup, so any bookie that registers after the auditor started and dies before being processed by runAuditTask() is never added to knownBookies and remains invisible to the subtraction-based loss detection. This caused testEmptyLedgerLosesQuorumEventually to fail when run after testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor 42899 started, two watcher tasks were queued (one per registration), and by the time they ran (after checkAllLedgers completed 112ms later) bookie 36375 had already died. pendingWritableBookies correctly reflected {42899,33105} but since 36375 was never in knownBookies the loss went undetected. The explicit submitAuditTask().get() call in the test suffered the same gap. Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false) instead of runAuditTask(). The full auditBookies() scan uses ledger metadata to find all bookies that own ledgers and checks their availability, detecting 36375 via the ledger it appears in regardless of knownBookies state. The runAuditTask() / knownBookies mechanism is still correct and sufficient for its intended purpose: detecting losses for bookies that were already known. The register-and-die edge case is an inherent limitation of that lightweight approach and is handled by the periodic auditorBookieCheckTask and by explicit submitAuditTask() calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lhotari · 2026-03-13T16:15:28Z

Although this fixes the flaky tests, there are multiple remaining race conditions in the auditor which could make it inoperational. For example, this doesn't look right for all cases:

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/ZKRegistrationClient.java

Lines 132 to 139 in 2eb70b1

    
           if (throwable != null) { 
        
               if (firstRunFuture.isDone()) { 
        
                   scheduleWatchTask(ZK_CONNECT_BACKOFF_MS); 
        
               } else { 
        
                   firstRunFuture.completeExceptionally(throwable); 
        
               } 
        
               return; 
        
           }

This is also surprising:

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/ZKRegistrationClient.java

Lines 422 to 427 in 2eb70b1

    
           watchReadOnlyBookiesTask = new WatchTask(bookieReadonlyRegistrationPath, f); 
        
           f = f.whenComplete((value, cause) -> { 
        
               if (null != cause) { 
        
                   unwatchReadOnlyBookies(listener); 
        
               } 
        
           });

If the first time fails, registering the watcher would be silently ignored.

lhotari and others added 5 commits March 13, 2026 11:35

Fix issue running tests in docker with multiple loopback interfaces a…

79baa3f

…nd IPs

lhotari marked this pull request as draft March 13, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Fix race conditions in Auditor bookie change detection#4728

[FIX] Fix race conditions in Auditor bookie change detection#4728
lhotari wants to merge 5 commits intoapache:masterfrom
lhotari:lh-fix-auditor-race-condition

lhotari commented Mar 13, 2026 •

edited

Loading

Uh oh!

lhotari commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lhotari commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Races fixed

Race 1 – watcher-provided bookie set was discarded

Race 2 – duplicate audit tasks could be queued per change type

Race 3 – knownBookies initialised after watcher registration

Race 4 – submitAuditTask() could not detect bookies that registered and died after auditor startup

Changes summary

Testing

Uh oh!

lhotari commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lhotari commented Mar 13, 2026 •

edited

Loading

Race 3 – `knownBookies` initialised after watcher registration

Race 4 – `submitAuditTask()` could not detect bookies that registered and died after auditor startup