Skip to content

[FIX] Fix race conditions in Auditor bookie change detection#4728

Draft
lhotari wants to merge 5 commits intoapache:masterfrom
lhotari:lh-fix-auditor-race-condition
Draft

[FIX] Fix race conditions in Auditor bookie change detection#4728
lhotari wants to merge 5 commits intoapache:masterfrom
lhotari:lh-fix-auditor-race-condition

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Mar 13, 2026

Motivation

While investigating a CI flakiness in BookieAutoRecoveryTest (PR #4711) — where tests waiting up to 60 s for ledgers to be marked as underreplicated would time out — several related race conditions in Auditor were found that could allow a bookie departure to go undetected until the next periodic audit run (default: 86400 s).

Analysis of the surefire logs for the failing CI run showed a ~70-second silence from the auditor after a bookie was killed, followed by a new auditor election that eventually caught the departure. This pointed to the auditor's bookie-change detection path rather than a test-setup issue.

Races fixed

Race 1 – watcher-provided bookie set was discarded

watchBookieChanges() registered bookies -> submitAuditTask() as the ZK listener, ignoring the current bookie set already delivered by the callback. Inside submitAuditTask() a fresh getAvailableBookies() ZK call was made from an executor lambda that could execute well after the watcher fired — for example while AuditorCheckAllLedgersTask was occupying the single-threaded executor. In that window the two snapshots of the cluster could diverge, causing bookiesToBeAudited to remain empty and the departure to be silently skipped.

Fix: capture pendingWritableBookies / pendingReadOnlyBookies atomically in each watcher callback and consume them in the extracted runAuditTask(). Fall back to a live getAvailableBookies() call only when one watcher type has not yet fired (to avoid an unnecessary ZK round-trip for the already-known half).

Race 2 – duplicate audit tasks could be queued per change type

Rapid successive watcher callbacks (or a slow executor) could enqueue many identical audit tasks, each making its own ZK calls and risking interleaved mutations to knownBookies / bookiesToBeAudited.

Fix: replace the original AtomicBoolean queued-flag with an AtomicInteger counter (writableAuditTaskCount, readOnlyAuditTaskCount). The counter tracks running + queued tasks per change type and is capped at MAX_AUDIT_TASKS_PER_TYPE = 2 (one in-progress + one queued). The queued task always reads the latest pendingWritableBookies / pendingReadOnlyBookies when it starts, so queuing more than one additional task would only duplicate work without improving correctness.

Race 3 – knownBookies initialised after watcher registration

start() called watchBookieChanges() before assigning knownBookies. A watcher callback firing immediately could race with the subsequent knownBookies = admin.getAllBookies() assignment.

Fix: swap the initialisation order so knownBookies is populated before watchBookieChanges() is called.

Race 4 – submitAuditTask() could not detect bookies that registered and died after auditor startup

submitAuditTask() (used by tests to explicitly trigger and wait for an audit) submitted runAuditTask(), which detects bookie losses by comparing pendingWritableBookies against knownBookies. knownBookies is seeded from admin.getAllBookies() at auditor startup, so any bookie that registered after the auditor started and died before being processed by runAuditTask() was never added to knownBookies and was invisible to the subtraction-based detection — regardless of the task cap.

This caused testEmptyLedgerLosesQuorumEventually to fail when run after testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor 42899 started; two watcher tasks were queued and were both blocked behind checkAllLedgers (112 ms); bookie 36375 died during that window; by the time the tasks ran, pendingWritableBookies correctly reflected {42899, 33105} but since 36375 was never in knownBookies the loss went undetected. Increasing MAX_AUDIT_TASKS_PER_TYPE does not help here — all tasks read pendingWritableBookies.get() when they execute, so they always see the latest snapshot (without 36375), and no amount of additional queued tasks changes that.

Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false) — the full ledger-metadata-based audit — rather than runAuditTask(). The full scan detects any bookie that appears in ledger ensembles but is no longer available, regardless of knownBookies state.

The runAuditTask() / knownBookies mechanism remains correct and sufficient for its intended purpose: detecting losses for bookies that were already known. The register-and-die edge case is an inherent limitation of the lightweight watcher-based approach and is handled by the periodic auditorBookieCheckTask and by explicit submitAuditTask() calls.

Changes summary

What Change
Auditor.knownBookies Add volatile; initialise before watchBookieChanges()
pendingWritableBookies, pendingReadOnlyBookies New AtomicReference<Set<String>> fields holding the latest bookie set per watcher type
writableAuditTaskCount, readOnlyAuditTaskCount New AtomicInteger counters (running+queued per type, capped at 2) replacing the original AtomicBoolean flags
watchBookieChanges() Populate the pending sets and call submitAuditTaskForBookieChange(boolean)
submitAuditTaskForBookieChange(boolean) Queues runAuditTask only when the per-type count allows
submitAuditTask() Now submits auditorBookieCheckTask.startAudit(false) (full audit via ledger metadata) instead of runAuditTask()
runAuditTask() Uses watcher-captured bookie sets; falls back to live ZK only when one watcher type is absent
start() Initialise knownBookies before calling watchBookieChanges()

Testing

Existing tests in BookieAutoRecoveryTest exercise all affected paths and continue to pass. The previously flaky test (testEmptyLedgerLosesQuorumEventually) now passes reliably when run as part of the full test class.

lhotari and others added 5 commits March 13, 2026 11:35
## Problem

Investigating CI flakiness in `BookieAutoRecoveryTest` revealed three
related race conditions in `Auditor` that could cause a bookie departure
to go undetected until the next periodic audit run (default: 86400 s),
causing tests that wait up to 60 s for underreplicated-ledger marking to
time out.

### Race 1 – watcher-provided bookie set was discarded

`watchBookieChanges()` passed `bookies -> submitAuditTask()` as the
listener, silently ignoring the current bookie set delivered by the ZK
watcher.  Inside `submitAuditTask()` a fresh `getAvailableBookies()` ZK
call was made from an executor lambda that could run well after the
watcher fired (e.g. while `checkAllLedgers` was occupying the
single-threaded executor).  In that window the two views of the cluster
could diverge, causing `bookiesToBeAudited` to end up empty and the
departure to be silently skipped.

Fix: capture `pendingWritableBookies` / `pendingReadOnlyBookies`
atomically in the watcher callback and read them in `runAuditTask()`,
falling back to `getAvailableBookies()` only when the watchers have not
yet fired (e.g. direct test calls via `submitAuditTask()`).

### Race 2 – duplicate audit tasks could be queued per change type

Rapid successive watcher callbacks (or a slow executor) could queue many
identical audit tasks, each making its own ZK calls, wasting resources
and risking interleaved state updates to `knownBookies` /
`bookiesToBeAudited`.

Fix: separate `AtomicBoolean` queued-flags (`writableAuditTaskQueued`,
`readOnlyAuditTaskQueued`) ensure at most one task is *queued* per change
type at a time.  The flag is cleared at the *start* of the task (not at
submission), so watcher callbacks that arrive while a task is already
running queue exactly one follow-up task rather than being silently
dropped.

### Race 3 – knownBookies initialised after watcher registration

`start()` called `watchBookieChanges()` before assigning `knownBookies`.
A watcher callback firing immediately (the ZK registration client fires
listeners synchronously with the current bookie set when one is already
cached) could race with the subsequent `knownBookies = admin.getAllBookies()`
assignment.  Although the existing `synchronized` on `start()` and
`submitAuditTask()` prevented the *lambda* from running before
`knownBookies` was set, the field itself was not `volatile`, leaving a
visibility gap for the executor thread.

Fix: swap the initialisation order so `knownBookies` is populated before
`watchBookieChanges()` is called, and declare the field `volatile` to
guarantee cross-thread visibility of the reference assignment.

## Changes

- `Auditor.knownBookies`: add `volatile`
- New fields `pendingWritableBookies`, `pendingReadOnlyBookies`
  (`AtomicReference<Set<String>>`): hold the latest bookie set from each
  watcher type
- New fields `writableAuditTaskQueued`, `readOnlyAuditTaskQueued`
  (`AtomicBoolean`): deduplication flags per change type
- `watchBookieChanges()`: populate the pending sets and call new
  `submitAuditTaskForBookieChange(boolean)` instead of `submitAuditTask()`
- `submitAuditTaskForBookieChange(boolean)`: queues a task only when no
  task of that type is already queued; clears the flag at task start
- `submitAuditTask()`: unchanged public/test API; now delegates to the
  extracted `runAuditTask()`
- `runAuditTask()`: extracted audit logic; uses watcher-captured bookie
  sets when available, falls back to `getAvailableBookies()`
- `start()`: initialise `knownBookies` before `watchBookieChanges()`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cation

The previous AtomicBoolean approach (flag cleared at task *start*) had a
subtle race: between the flag being cleared and `runAuditTask()` actually
reading `pendingWritableBookies`, a watcher callback could see `flag=false`
and submit a second concurrent task. Because the flag gives no information
about a task that is already *running*, the invariant "at most one queued
+ one in-progress" was not cleanly enforced.

Replace with an `AtomicInteger` counter (max 2) that is:
- incremented synchronously at submission time (inside the existing
  `synchronized` block that also protects the `isShutdown()` check)
- decremented in a `finally` block when the task *finishes*

This accurately counts all live tasks (running or queued) for the full
duration of their lifetime. When the counter reaches 2 (one in-progress +
one queued), further watcher callbacks are dropped — the queued task will
read the latest `pendingWritableBookies` / `pendingReadOnlyBookies` when it
eventually executes, so no additional task provides any benefit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… absent

The previous fallback in runAuditTask() called getAvailableBookies() (both
writable + readonly ZK reads) whenever *either* pending snapshot was null.
This was wrong: writable and readonly watcher callbacks are independent and
fire separately, so one snapshot can be present while the other is not yet
populated (e.g. right after startup when only one type of change has
occurred).

Replace the single else-branch with three cases:
- Both snapshots present: combine them (no ZK call, as before).
- Only writable snapshot present: use it + call admin.getReadOnlyBookies().
- Only readonly snapshot present: call admin.getAvailableBookies() + use it.
- Neither present (direct test calls via submitAuditTask()): call the full
  getAvailableBookies() as before.

This ensures we always use the watcher-captured data for the half that is
available and only make a targeted ZK round-trip for the missing half.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…knownBookies

submitAuditTask() was submitting runAuditTask(), which compares
pendingWritableBookies against knownBookies to detect lost bookies.
knownBookies is seeded from admin.getAllBookies() at auditor startup, so
any bookie that registers after the auditor started and dies before being
processed by runAuditTask() is never added to knownBookies and remains
invisible to the subtraction-based loss detection.

This caused testEmptyLedgerLosesQuorumEventually to fail when run after
testNoSuchLedgerExists: bookies 33105 and 36375 registered after auditor
42899 started, two watcher tasks were queued (one per registration), and
by the time they ran (after checkAllLedgers completed 112ms later) bookie
36375 had already died. pendingWritableBookies correctly reflected
{42899,33105} but since 36375 was never in knownBookies the loss went
undetected. The explicit submitAuditTask().get() call in the test suffered
the same gap.

Fix: submitAuditTask() now submits auditorBookieCheckTask.startAudit(false)
instead of runAuditTask(). The full auditBookies() scan uses ledger metadata
to find all bookies that own ledgers and checks their availability, detecting
36375 via the ledger it appears in regardless of knownBookies state.

The runAuditTask() / knownBookies mechanism is still correct and sufficient
for its intended purpose: detecting losses for bookies that were already
known. The register-and-die edge case is an inherent limitation of that
lightweight approach and is handled by the periodic auditorBookieCheckTask
and by explicit submitAuditTask() calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lhotari lhotari marked this pull request as draft March 13, 2026 16:10
@lhotari
Copy link
Member Author

lhotari commented Mar 13, 2026

Although this fixes the flaky tests, there are multiple remaining race conditions in the auditor which could make it inoperational. For example, this doesn't look right for all cases:

if (throwable != null) {
if (firstRunFuture.isDone()) {
scheduleWatchTask(ZK_CONNECT_BACKOFF_MS);
} else {
firstRunFuture.completeExceptionally(throwable);
}
return;
}

This is also surprising:
watchReadOnlyBookiesTask = new WatchTask(bookieReadonlyRegistrationPath, f);
f = f.whenComplete((value, cause) -> {
if (null != cause) {
unwatchReadOnlyBookies(listener);
}
});

If the first time fails, registering the watcher would be silently ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant