Skip to content

Reset LSPS5 persistence_in_flight counter on persist errors#4597

Open
tnull wants to merge 2 commits intolightningdevkit:mainfrom
tnull:2026-05-lsps5-persist-counter-leak
Open

Reset LSPS5 persistence_in_flight counter on persist errors#4597
tnull wants to merge 2 commits intolightningdevkit:mainfrom
tnull:2026-05-lsps5-persist-counter-leak

Conversation

@tnull
Copy link
Copy Markdown
Contributor

@tnull tnull commented May 5, 2026

LSPS5ServiceHandler::persist incremented persistence_in_flight at the top as a single-runner gate, but only decremented it on the success path: each interior ? on a kv_store future propagated the error out of the function while leaving the counter at >= 1. After one transient I/O failure (disk full, brief unavailability of a remote KVStore, EPERM, etc.) every subsequent persist() call hit the fetch_add > 0 short-circuit and silently returned Ok(false).

The in-memory needs_persist flags then continued to grow without ever reaching disk, so webhook state, removals, and notification cooldowns were lost on the next process restart — including the spec-mandated webhook retention/pruning state — without any error surfaced to the operator. The counter is monotonic, so recovery required a process restart.

Adopt the LSPS1 / LSPS2 pattern: split the body into an inner do_persist and an outer persist that unconditionally clears the counter via store(0) after the call returns, regardless of outcome. A failed write now still propagates Err, but the next persist() attempt actually retries the write instead of no-op'ing.

Co-Authored-By: HAL 9000

`LSPS5ServiceHandler::persist` incremented `persistence_in_flight` at
the top as a single-runner gate, but only decremented it on the
success path: each interior `?` on a `kv_store` future propagated the
error out of the function while leaving the counter at >= 1. After
one transient I/O failure (disk full, brief unavailability of a
remote `KVStore`, EPERM, etc.) every subsequent `persist()` call hit
the `fetch_add > 0` short-circuit and silently returned `Ok(false)`.

The in-memory `needs_persist` flags then continued to grow without
ever reaching disk, so webhook state, removals, and notification
cooldowns were lost on the next process restart — including the
spec-mandated webhook retention/pruning state — without any error
surfaced to the operator. The counter is monotonic, so recovery
required a process restart.

Adopt the LSPS1 / LSPS2 pattern: split the body into an inner
`do_persist` and an outer `persist` that unconditionally clears the
counter via `store(0)` after the call returns, regardless of
outcome. A failed write now still propagates `Err`, but the next
`persist()` attempt actually retries the write instead of no-op'ing.

Co-Authored-By: HAL 9000
@ldk-reviews-bot
Copy link
Copy Markdown

ldk-reviews-bot commented May 5, 2026

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@ldk-claude-review-bot
Copy link
Copy Markdown
Collaborator

ldk-claude-review-bot commented May 5, 2026

I've now thoroughly reviewed the entire PR diff, comparing the LSPS5 implementation against the established LSPS1/LSPS2 patterns. The fix is functionally correct - the counter is properly reset on all paths.

No issues found.

The code correctly ensures persistence_in_flight is reset to 0 on error (via explicit store(0)) and on success (via fetch_sub bringing it to 0). The structural placement of the loop differs from LSPS1/LSPS2 (loop in persist vs do_persist) but the semantics are equivalent. The test properly validates the bug scenario using a FailableKVStore that selectively fails LSPS5 namespace writes.

@ldk-reviews-bot ldk-reviews-bot requested a review from joostjager May 5, 2026 20:25

let res = self.do_persist().await;
debug_assert!(res.is_err() || self.persistence_in_flight.load(Ordering::Acquire) == 0);
self.persistence_in_flight.store(0, Ordering::Release);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this races with a second writer that is started at the same time. We should move the loop out of the inner method and into this method to control the flag entirely in this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fixup.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.28%. Comparing base (1a26867) to head (fe04494).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
lightning-liquidity/src/lsps5/service.rs 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4597      +/-   ##
==========================================
- Coverage   86.84%   86.28%   -0.56%     
==========================================
  Files         161      159       -2     
  Lines      109260   109275      +15     
  Branches   109260   109275      +15     
==========================================
- Hits        94882    94292     -590     
- Misses      11797    12377     +580     
- Partials     2581     2606      +25     
Flag Coverage Δ
fuzzing-fake-hashes ?
fuzzing-real-hashes ?
tests 86.28% <87.50%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheBlueMatt TheBlueMatt removed the request for review from joostjager May 6, 2026 01:05
@tnull tnull requested a review from TheBlueMatt May 6, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants