Skip to content

fix(storage): retry transient lbug WAL→checkpoint race in bulkLoad#161

Merged
theagenticguy merged 1 commit into
mainfrom
fix/lbug-wal-checkpoint-retry
May 29, 2026
Merged

fix(storage): retry transient lbug WAL→checkpoint race in bulkLoad#161
theagenticguy merged 1 commit into
mainfrom
fix/lbug-wal-checkpoint-retry

Conversation

@theagenticguy
Copy link
Copy Markdown
Owner

Summary

Root-causes and fixes the macOS Verify Global Install "flake"codehub analyze intermittently exited non-zero on loaded CI runners (the failing leg varied run-to-run; never failed on an idle box) while the graph itself built fine (the query smoke right after always passed).

Why it was undiagnosable

The analyze smoke ran codehub analyze "$FIXTURE" >/dev/null 2>&1 — so the actual error that set exit 1 never reached the CI log. That's the entire reason it read as a flake instead of a bug. Reproduced locally under contention (1/10 concurrent analyze runs failed) and captured the swallowed error:

IO exception: Error renaming file <db>.lbug.wal to <db>.lbug.wal.checkpoint.
ErrorMessage: No such file or directory

This is the native lbug binding's auto-checkpoint racing under CPU/IO pressure — the WAL→checkpoint rename fails even though the data is already durably in the WAL (a reopen recovers it). Single-process, isolated .codehub/ dir; it's timing, not cross-process contention. Unretried, it bubbles to the CLI's top-level parseAsync().catch() and fails analyze with exit 1.

Fix (two parts)

1. Retry the transient error (packages/storage/src/graphdb-adapter.ts)
GraphDbStore.bulkLoad now wraps the load in retryTransientCheckpoint — up to 3 attempts, 25/50ms backoff, gated on isTransientCheckpointError. replace-mode bulkLoad is idempotent (truncate-then-insert fully replaces), so retry is safe. The matcher keys on the stable token trio (renaming + .wal + checkpoint), not the OS-specific errno suffix. Every non-transient error rethrows immediately — no broadening of swallowed failures.

2. Make the smoke diagnosable (scripts/verify-global-install.sh)
The analyze smoke now captures combined output to a temp log and prints the tail (with rc) on failure, so a future non-zero exit shows the real cause instead of being silently swallowed. This is the durable fix — the flake was invisible only because the harness hid it.

Verification

  • Reproduced the race (1/10 concurrent runs) pre-fix; 48/48 concurrent runs clean post-fix.
  • storage 159/159 — 11 new tests: matcher pos/neg (real lbug string, errno variants, unrelated IO, generic checkpoint, constraint error, null/undefined) + retry policy (recover-after-N, first-try success, exhaustion-rethrow, non-transient-immediate-rethrow). Retry path proven directly via injected fn, not by provoking a native race.
  • ingestion 602/602; cli 256/256; typecheck + biome clean; bash syntax OK.

Test plan

  • transient error recovered within maxAttempts
  • non-transient error rethrows without retry
  • exhaustion rethrows after N attempts
  • smoke prints analyze tail on failure

Root-causes the macOS Verify-Global-Install "flake": `codehub analyze`
intermittently exited non-zero on loaded CI runners (varying legs, never on an
idle box) while the graph itself built fine (query worked after). The smoke
ran analyze with `>/dev/null 2>&1`, so the real error never reached the log —
which is why it read as an undiagnosable flake.

Reproduced under contention (1/10 concurrent analyze runs) and captured the
swallowed error:

  IO exception: Error renaming file <db>.lbug.wal to <db>.lbug.wal.checkpoint.
  ErrorMessage: No such file or directory

This is the native lbug binding's auto-checkpoint racing under CPU/IO
pressure. The data is already durably in the WAL (a reopen recovers it), so
the failure is a flaky teardown artifact — but unretried it bubbles to the
CLI's top-level catch and fails analyze with exit 1.

Fix (two parts):
- storage: `GraphDbStore.bulkLoad` now wraps the load in
  `retryTransientCheckpoint` — up to 3 attempts with 25/50ms backoff, gated
  on `isTransientCheckpointError` (matches the stable renaming+.wal+checkpoint
  token trio, not the OS-specific errno). replace-mode bulkLoad is idempotent
  (truncate-then-insert), so retry is safe. Every non-transient error rethrows
  immediately. Both helpers are exported + unit-tested (recover-after-N,
  exhaustion-rethrow, non-transient-immediate-rethrow, matcher pos/neg cases).
- verify-global-install.sh: the analyze smoke now captures combined output to
  a temp log and prints the tail on failure (rc included), so a future
  non-zero exit is diagnosable instead of silently swallowed.

storage 159/159; ingestion 602/602; cli 256/256.
@theagenticguy theagenticguy merged commit 450714c into main May 29, 2026
43 of 45 checks passed
@theagenticguy theagenticguy deleted the fix/lbug-wal-checkpoint-retry branch May 29, 2026 12:55
@github-actions github-actions Bot mentioned this pull request May 29, 2026
theagenticguy added a commit that referenced this pull request May 29, 2026
## Summary

Fixes the intermittent **Volta macOS leg** failure in Verify Global
Install — gate 2 (GHCR/postinstall fetch) + gate 4 (install > 60s
budget) — that persisted on `main` even after node-pty was removed from
the dependency graph.

## Root cause (pinned, not guessed)

**No OpenCodeHub package depends on node-pty anymore** — the dep was
removed in the graphty-Leiden vendoring (#157). Verified:
- `grep` across all `packages/*/package.json` → 0 references
- main's `pnpm-lock.yaml` → 0 occurrences
- packed `opencodehub-ingestion-0.4.3.tgz` → graphty ABSENT, ships
vendored `graphty-leiden.js`, no node-pty in deps

Yet Volta's `npm install -g` still fetched `node-pty-prebuilt-multiarch`
from GitHub releases. The tell: **arm64-nvm passed gate 2 on the SAME
run** while Volta failed it. The script installed into whatever global
prefix the node manager provided, and **Volta persists its global
package dir across runs** on the hosted runner. A node-pty left behind
by a pre-removal run re-ran its `prebuild-install` GHCR fetch on the
next `npm install -g` — and bloated install time to 75-95s (vs 25-50s on
the clean legs). It's cached cross-run runner state, not the dependency
graph.

## Fix

Install into a fresh `mktemp -d` prefix per cell (`npm_config_prefix` +
`PATH` prepend), removed on the existing `EXIT` trap. Each cell is now
**hermetic** — the gates see only what *this* run's tarballs actually
pull, immune to whatever a prior run left in a manager-managed global
dir.

## Verification

Ran the harness locally end-to-end (`bash
scripts/verify-global-install.sh local` — packs all 17 workspace
tarballs, global-installs into the isolated prefix, runs all gates):

```
isolated npm global prefix: /var/folders/.../verify-global-install-prefix.XXX
install exit=0 duration=12s
[PASS] gate 1  [PASS] gate 2 (zero GHCR fetches)  [PASS] gate 3  [PASS] gate 4 (12s)  [PASS] gate 5
[PASS] smoke: analyze  [PASS] smoke: query  [PASS] smoke: --version  [PASS] smoke: --help
passed=9 failed=0
```

## Context

Third of a small flake-elimination set, all from the same
Verify-Global-Install investigation:
- #161 (merged) — lbug WAL→checkpoint retry (fixed the `analyze`-smoke
flake)
- this PR — hermetic prefix (fixes the Volta gate-2/gate-4 cached-state
flake)

Together these make the macOS legs deterministic. (Verify Global Install
is not yet a required check; this is the work to make it green enough to
opt in.)

## Test plan
- [x] harness 9/9 locally, gate 2 clean, isolated prefix created +
removed
- [x] bash syntax OK; EXIT-trap cleanup guarded for early-exit
theagenticguy pushed a commit that referenced this pull request May 29, 2026
🤖 Automated release via release-please
---


<details><summary>analysis: 0.3.2</summary>

##
[0.3.2](analysis-v0.3.1...analysis-v0.3.2)
(2026-05-29)


### Bug Fixes

* **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported
node range
([#155](#155))
([a723e53](a723e53))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/storage bumped to 0.2.2
    * @opencodehub/wiki bumped to 0.2.2
</details>

<details><summary>cli: 0.5.4</summary>

##
[0.5.4](cli-v0.5.3...cli-v0.5.4)
(2026-05-29)


### Features

* **cli:** doctor checks vendored wasm grammars + scip indexers
(--strict)
([#159](#159))
([36a241e](36a241e))


### Bug Fixes

* **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported
node range
([#155](#155))
([a723e53](a723e53))
* **scanners:** correct scanner exit-code handling and stop duplicate
skip logs
([#156](#156))
([5d30eb4](5d30eb4))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/analysis bumped to 0.3.2
    * @opencodehub/ingestion bumped to 0.4.4
    * @opencodehub/mcp bumped to 0.4.3
    * @opencodehub/pack bumped to 0.2.3
    * @opencodehub/scanners bumped to 0.2.1
    * @opencodehub/search bumped to 0.2.2
    * @opencodehub/storage bumped to 0.2.2
    * @opencodehub/wiki bumped to 0.2.2
</details>

<details><summary>cobol-proleap: 0.1.8</summary>

##
[0.1.8](cobol-proleap-v0.1.7...cobol-proleap-v0.1.8)
(2026-05-29)


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/ingestion bumped to 0.4.4
</details>

<details><summary>ingestion: 0.4.4</summary>

##
[0.4.4](ingestion-v0.4.3...ingestion-v0.4.4)
(2026-05-29)


### Bug Fixes

* **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported
node range
([#155](#155))
([a723e53](a723e53))
* **ingestion:** vendor graphty Leiden to drop node-pty install fetch
([#157](#157))
([790ca4e](790ca4e))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/analysis bumped to 0.3.2
    * @opencodehub/scip-ingest bumped to 0.2.4
    * @opencodehub/storage bumped to 0.2.2
</details>

<details><summary>mcp: 0.4.3</summary>

##
[0.4.3](mcp-v0.4.2...mcp-v0.4.3)
(2026-05-29)


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/analysis bumped to 0.3.2
    * @opencodehub/pack bumped to 0.2.3
    * @opencodehub/scanners bumped to 0.2.1
    * @opencodehub/search bumped to 0.2.2
    * @opencodehub/storage bumped to 0.2.2
</details>

<details><summary>pack: 0.2.3</summary>

##
[0.2.3](pack-v0.2.2...pack-v0.2.3)
(2026-05-29)


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/analysis bumped to 0.3.2
    * @opencodehub/ingestion bumped to 0.4.4
    * @opencodehub/storage bumped to 0.2.2
</details>

<details><summary>scanners: 0.2.1</summary>

##
[0.2.1](scanners-v0.2.0...scanners-v0.2.1)
(2026-05-29)


### Bug Fixes

* **scanners:** correct scanner exit-code handling and stop duplicate
skip logs
([#156](#156))
([5d30eb4](5d30eb4))
</details>

<details><summary>scip-ingest: 0.2.4</summary>

##
[0.2.4](scip-ingest-v0.2.3...scip-ingest-v0.2.4)
(2026-05-29)


### Bug Fixes

* **scanners:** correct scanner exit-code handling and stop duplicate
skip logs
([#156](#156))
([5d30eb4](5d30eb4))
* **scip-ingest:** prepend ~/.codehub/bin to indexer spawn PATH
([#160](#160))
([4418db9](4418db9))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/analysis bumped to 0.3.2
</details>

<details><summary>search: 0.2.2</summary>

##
[0.2.2](search-v0.2.1...search-v0.2.2)
(2026-05-29)


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/storage bumped to 0.2.2
</details>

<details><summary>storage: 0.2.2</summary>

##
[0.2.2](storage-v0.2.1...storage-v0.2.2)
(2026-05-29)


### Bug Fixes

* **storage:** retry transient lbug WAL→checkpoint race in bulkLoad
([#161](#161))
([450714c](450714c))
</details>

<details><summary>wiki: 0.2.2</summary>

##
[0.2.2](wiki-v0.2.1...wiki-v0.2.2)
(2026-05-29)


### Bug Fixes

* **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported
node range
([#155](#155))
([a723e53](a723e53))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * @opencodehub/storage bumped to 0.2.2
</details>

<details><summary>root: 0.6.5</summary>

##
[0.6.5](root-v0.6.4...root-v0.6.5)
(2026-05-29)


### Features

* **cli:** doctor checks vendored wasm grammars + scip indexers
(--strict)
([#159](#159))
([36a241e](36a241e))


### Bug Fixes

* **ci:** isolate verify-global-install into a per-run npm prefix
([#162](#162))
([3b59373](3b59373))
* **deps:** bump qs 6.15.1→6.15.2 and tmp 0.2.4→0.2.6 to clear osv
findings
([#151](#151))
([2f798ec](2f798ec))
* **deps:** downgrade write-file-atomic 8.0.0→7.0.1 to match supported
node range
([#155](#155))
([a723e53](a723e53))
* **ingestion:** vendor graphty Leiden to drop node-pty install fetch
([#157](#157))
([790ca4e](790ca4e))
* **scanners:** correct scanner exit-code handling and stop duplicate
skip logs
([#156](#156))
([5d30eb4](5d30eb4))
* **scip-ingest:** prepend ~/.codehub/bin to indexer spawn PATH
([#160](#160))
([4418db9](4418db9))
* **storage:** retry transient lbug WAL→checkpoint race in bulkLoad
([#161](#161))
([450714c](450714c))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant