fix(storage): retry transient lbug WAL→checkpoint race in bulkLoad#161
Merged
Conversation
Root-causes the macOS Verify-Global-Install "flake": `codehub analyze` intermittently exited non-zero on loaded CI runners (varying legs, never on an idle box) while the graph itself built fine (query worked after). The smoke ran analyze with `>/dev/null 2>&1`, so the real error never reached the log — which is why it read as an undiagnosable flake. Reproduced under contention (1/10 concurrent analyze runs) and captured the swallowed error: IO exception: Error renaming file <db>.lbug.wal to <db>.lbug.wal.checkpoint. ErrorMessage: No such file or directory This is the native lbug binding's auto-checkpoint racing under CPU/IO pressure. The data is already durably in the WAL (a reopen recovers it), so the failure is a flaky teardown artifact — but unretried it bubbles to the CLI's top-level catch and fails analyze with exit 1. Fix (two parts): - storage: `GraphDbStore.bulkLoad` now wraps the load in `retryTransientCheckpoint` — up to 3 attempts with 25/50ms backoff, gated on `isTransientCheckpointError` (matches the stable renaming+.wal+checkpoint token trio, not the OS-specific errno). replace-mode bulkLoad is idempotent (truncate-then-insert), so retry is safe. Every non-transient error rethrows immediately. Both helpers are exported + unit-tested (recover-after-N, exhaustion-rethrow, non-transient-immediate-rethrow, matcher pos/neg cases). - verify-global-install.sh: the analyze smoke now captures combined output to a temp log and prints the tail on failure (rc included), so a future non-zero exit is diagnosable instead of silently swallowed. storage 159/159; ingestion 602/602; cli 256/256.
Merged
2 tasks
theagenticguy
added a commit
that referenced
this pull request
May 29, 2026
## Summary Fixes the intermittent **Volta macOS leg** failure in Verify Global Install — gate 2 (GHCR/postinstall fetch) + gate 4 (install > 60s budget) — that persisted on `main` even after node-pty was removed from the dependency graph. ## Root cause (pinned, not guessed) **No OpenCodeHub package depends on node-pty anymore** — the dep was removed in the graphty-Leiden vendoring (#157). Verified: - `grep` across all `packages/*/package.json` → 0 references - main's `pnpm-lock.yaml` → 0 occurrences - packed `opencodehub-ingestion-0.4.3.tgz` → graphty ABSENT, ships vendored `graphty-leiden.js`, no node-pty in deps Yet Volta's `npm install -g` still fetched `node-pty-prebuilt-multiarch` from GitHub releases. The tell: **arm64-nvm passed gate 2 on the SAME run** while Volta failed it. The script installed into whatever global prefix the node manager provided, and **Volta persists its global package dir across runs** on the hosted runner. A node-pty left behind by a pre-removal run re-ran its `prebuild-install` GHCR fetch on the next `npm install -g` — and bloated install time to 75-95s (vs 25-50s on the clean legs). It's cached cross-run runner state, not the dependency graph. ## Fix Install into a fresh `mktemp -d` prefix per cell (`npm_config_prefix` + `PATH` prepend), removed on the existing `EXIT` trap. Each cell is now **hermetic** — the gates see only what *this* run's tarballs actually pull, immune to whatever a prior run left in a manager-managed global dir. ## Verification Ran the harness locally end-to-end (`bash scripts/verify-global-install.sh local` — packs all 17 workspace tarballs, global-installs into the isolated prefix, runs all gates): ``` isolated npm global prefix: /var/folders/.../verify-global-install-prefix.XXX install exit=0 duration=12s [PASS] gate 1 [PASS] gate 2 (zero GHCR fetches) [PASS] gate 3 [PASS] gate 4 (12s) [PASS] gate 5 [PASS] smoke: analyze [PASS] smoke: query [PASS] smoke: --version [PASS] smoke: --help passed=9 failed=0 ``` ## Context Third of a small flake-elimination set, all from the same Verify-Global-Install investigation: - #161 (merged) — lbug WAL→checkpoint retry (fixed the `analyze`-smoke flake) - this PR — hermetic prefix (fixes the Volta gate-2/gate-4 cached-state flake) Together these make the macOS legs deterministic. (Verify Global Install is not yet a required check; this is the work to make it green enough to opt in.) ## Test plan - [x] harness 9/9 locally, gate 2 clean, isolated prefix created + removed - [x] bash syntax OK; EXIT-trap cleanup guarded for early-exit
theagenticguy
pushed a commit
that referenced
this pull request
May 29, 2026
🤖 Automated release via release-please --- <details><summary>analysis: 0.3.2</summary> ## [0.3.2](analysis-v0.3.1...analysis-v0.3.2) (2026-05-29) ### Bug Fixes * **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported node range ([#155](#155)) ([a723e53](a723e53)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/storage bumped to 0.2.2 * @opencodehub/wiki bumped to 0.2.2 </details> <details><summary>cli: 0.5.4</summary> ## [0.5.4](cli-v0.5.3...cli-v0.5.4) (2026-05-29) ### Features * **cli:** doctor checks vendored wasm grammars + scip indexers (--strict) ([#159](#159)) ([36a241e](36a241e)) ### Bug Fixes * **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported node range ([#155](#155)) ([a723e53](a723e53)) * **scanners:** correct scanner exit-code handling and stop duplicate skip logs ([#156](#156)) ([5d30eb4](5d30eb4)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.3.2 * @opencodehub/ingestion bumped to 0.4.4 * @opencodehub/mcp bumped to 0.4.3 * @opencodehub/pack bumped to 0.2.3 * @opencodehub/scanners bumped to 0.2.1 * @opencodehub/search bumped to 0.2.2 * @opencodehub/storage bumped to 0.2.2 * @opencodehub/wiki bumped to 0.2.2 </details> <details><summary>cobol-proleap: 0.1.8</summary> ## [0.1.8](cobol-proleap-v0.1.7...cobol-proleap-v0.1.8) (2026-05-29) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/ingestion bumped to 0.4.4 </details> <details><summary>ingestion: 0.4.4</summary> ## [0.4.4](ingestion-v0.4.3...ingestion-v0.4.4) (2026-05-29) ### Bug Fixes * **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported node range ([#155](#155)) ([a723e53](a723e53)) * **ingestion:** vendor graphty Leiden to drop node-pty install fetch ([#157](#157)) ([790ca4e](790ca4e)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.3.2 * @opencodehub/scip-ingest bumped to 0.2.4 * @opencodehub/storage bumped to 0.2.2 </details> <details><summary>mcp: 0.4.3</summary> ## [0.4.3](mcp-v0.4.2...mcp-v0.4.3) (2026-05-29) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.3.2 * @opencodehub/pack bumped to 0.2.3 * @opencodehub/scanners bumped to 0.2.1 * @opencodehub/search bumped to 0.2.2 * @opencodehub/storage bumped to 0.2.2 </details> <details><summary>pack: 0.2.3</summary> ## [0.2.3](pack-v0.2.2...pack-v0.2.3) (2026-05-29) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.3.2 * @opencodehub/ingestion bumped to 0.4.4 * @opencodehub/storage bumped to 0.2.2 </details> <details><summary>scanners: 0.2.1</summary> ## [0.2.1](scanners-v0.2.0...scanners-v0.2.1) (2026-05-29) ### Bug Fixes * **scanners:** correct scanner exit-code handling and stop duplicate skip logs ([#156](#156)) ([5d30eb4](5d30eb4)) </details> <details><summary>scip-ingest: 0.2.4</summary> ## [0.2.4](scip-ingest-v0.2.3...scip-ingest-v0.2.4) (2026-05-29) ### Bug Fixes * **scanners:** correct scanner exit-code handling and stop duplicate skip logs ([#156](#156)) ([5d30eb4](5d30eb4)) * **scip-ingest:** prepend ~/.codehub/bin to indexer spawn PATH ([#160](#160)) ([4418db9](4418db9)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.3.2 </details> <details><summary>search: 0.2.2</summary> ## [0.2.2](search-v0.2.1...search-v0.2.2) (2026-05-29) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/storage bumped to 0.2.2 </details> <details><summary>storage: 0.2.2</summary> ## [0.2.2](storage-v0.2.1...storage-v0.2.2) (2026-05-29) ### Bug Fixes * **storage:** retry transient lbug WAL→checkpoint race in bulkLoad ([#161](#161)) ([450714c](450714c)) </details> <details><summary>wiki: 0.2.2</summary> ## [0.2.2](wiki-v0.2.1...wiki-v0.2.2) (2026-05-29) ### Bug Fixes * **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported node range ([#155](#155)) ([a723e53](a723e53)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/storage bumped to 0.2.2 </details> <details><summary>root: 0.6.5</summary> ## [0.6.5](root-v0.6.4...root-v0.6.5) (2026-05-29) ### Features * **cli:** doctor checks vendored wasm grammars + scip indexers (--strict) ([#159](#159)) ([36a241e](36a241e)) ### Bug Fixes * **ci:** isolate verify-global-install into a per-run npm prefix ([#162](#162)) ([3b59373](3b59373)) * **deps:** bump qs 6.15.1→6.15.2 and tmp 0.2.4→0.2.6 to clear osv findings ([#151](#151)) ([2f798ec](2f798ec)) * **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported node range ([#155](#155)) ([a723e53](a723e53)) * **ingestion:** vendor graphty Leiden to drop node-pty install fetch ([#157](#157)) ([790ca4e](790ca4e)) * **scanners:** correct scanner exit-code handling and stop duplicate skip logs ([#156](#156)) ([5d30eb4](5d30eb4)) * **scip-ingest:** prepend ~/.codehub/bin to indexer spawn PATH ([#160](#160)) ([4418db9](4418db9)) * **storage:** retry transient lbug WAL→checkpoint race in bulkLoad ([#161](#161)) ([450714c](450714c)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Root-causes and fixes the macOS
Verify Global Install"flake" —codehub analyzeintermittently exited non-zero on loaded CI runners (the failing leg varied run-to-run; never failed on an idle box) while the graph itself built fine (thequerysmoke right after always passed).Why it was undiagnosable
The analyze smoke ran
codehub analyze "$FIXTURE" >/dev/null 2>&1— so the actual error that set exit 1 never reached the CI log. That's the entire reason it read as a flake instead of a bug. Reproduced locally under contention (1/10 concurrent analyze runs failed) and captured the swallowed error:This is the native lbug binding's auto-checkpoint racing under CPU/IO pressure — the WAL→checkpoint rename fails even though the data is already durably in the WAL (a reopen recovers it). Single-process, isolated
.codehub/dir; it's timing, not cross-process contention. Unretried, it bubbles to the CLI's top-levelparseAsync().catch()and fails analyze with exit 1.Fix (two parts)
1. Retry the transient error (
packages/storage/src/graphdb-adapter.ts)GraphDbStore.bulkLoadnow wraps the load inretryTransientCheckpoint— up to 3 attempts, 25/50ms backoff, gated onisTransientCheckpointError. replace-mode bulkLoad is idempotent (truncate-then-insert fully replaces), so retry is safe. The matcher keys on the stable token trio (renaming+.wal+checkpoint), not the OS-specific errno suffix. Every non-transient error rethrows immediately — no broadening of swallowed failures.2. Make the smoke diagnosable (
scripts/verify-global-install.sh)The analyze smoke now captures combined output to a temp log and prints the tail (with
rc) on failure, so a future non-zero exit shows the real cause instead of being silently swallowed. This is the durable fix — the flake was invisible only because the harness hid it.Verification
Test plan