Skip to content

Adding a design document for delta-forming collection metadata.#485

Open
bootjp wants to merge 22 commits intomainfrom
feature/add-list-delta-design-doc
Open

Adding a design document for delta-forming collection metadata.#485
bootjp wants to merge 22 commits intomainfrom
feature/add-list-delta-design-doc

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 11, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new design for list metadata using a Delta and Claim pattern to eliminate write-write conflicts during high-concurrency operations like RPUSH and LPOP. The design shifts from a Read-Modify-Write model on a single metadata key to appending unique delta entries and using CAS-based claim keys for item removal, supported by a background compaction worker. The review feedback highlights critical improvements needed for key layout to prevent prefix collisions, strategies for handling delta scan limits to ensure data integrity, and the necessity of bounding transactions and migration plans to ensure operational stability.

Comment on lines +51 to +62
func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte {
buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+8+4)
buf = append(buf, ListMetaDeltaPrefix...)
buf = append(buf, userKey...)
var ts [8]byte
binary.BigEndian.PutUint64(ts[:], commitTS)
buf = append(buf, ts[:]...)
var seq [4]byte
binary.BigEndian.PutUint32(seq[:], seqInTxn)
buf = append(buf, seq[:]...)
return buf
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Delta key layout is susceptible to prefix collisions. For example, a prefix scan for userKey = "foo" will incorrectly include keys for userKey = "foobar". To prevent this, the userKey should be followed by a separator (e.g., a null byte) or its length should be encoded before the fixed-length suffix.

Suggested change
func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte {
buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+8+4)
buf = append(buf, ListMetaDeltaPrefix...)
buf = append(buf, userKey...)
var ts [8]byte
binary.BigEndian.PutUint64(ts[:], commitTS)
buf = append(buf, ts[:]...)
var seq [4]byte
binary.BigEndian.PutUint32(seq[:], seqInTxn)
buf = append(buf, seq[:]...)
return buf
}
func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte {
buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+1+8+4)
buf = append(buf, ListMetaDeltaPrefix...)
buf = append(buf, userKey...)
buf = append(buf, 0) // Separator to prevent prefix collisions
var ts [8]byte
binary.BigEndian.PutUint64(ts[:], commitTS)
buf = append(buf, ts[:]...)
var seq [4]byte
binary.BigEndian.PutUint32(seq[:], seqInTxn)
buf = append(buf, seq[:]...)
return buf
}

Comment on lines +196 to +204
func ListClaimKey(userKey []byte, seq int64) []byte {
var raw [8]byte
encodeSortableInt64(raw[:], seq)
buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+8)
buf = append(buf, ListClaimPrefix...)
buf = append(buf, userKey...)
buf = append(buf, raw[:]...)
return buf
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Claim key layout suffers from the same prefix collision risk as the Delta keys. Disambiguate the userKey from the sequence suffix using a separator or length encoding to ensure that operations on one key do not affect another key that shares its prefix.

Suggested change
func ListClaimKey(userKey []byte, seq int64) []byte {
var raw [8]byte
encodeSortableInt64(raw[:], seq)
buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+8)
buf = append(buf, ListClaimPrefix...)
buf = append(buf, userKey...)
buf = append(buf, raw[:]...)
return buf
}
func ListClaimKey(userKey []byte, seq int64) []byte {
var raw [8]byte
encodeSortableInt64(raw[:], seq)
buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+1+8)
buf = append(buf, ListClaimPrefix...)
buf = append(buf, userKey...)
buf = append(buf, 0) // Separator
buf = append(buf, raw[:]...)
return buf
}

@@ -0,0 +1,384 @@
# List Metadata Delta Design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The design document should detail potential mitigation strategies for rolling upgrades, such as implementing a temporary 'bridge' or 'proxy' mode, to address the operational risk of transitioning to the new Delta layout.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

@@ -0,0 +1,384 @@
# List Metadata Delta Design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The plan for migrating to the new metadata layout must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the transition.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.


// 2. Fetch Deltas via prefix scan
prefix := ListMetaDeltaScanPrefix(userKey)
deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the number of unapplied deltas exceeds maxDeltaScanLimit, resolveListMeta will return an incorrect ListMeta due to incomplete aggregation. The design should specify a strategy for this case, such as returning an error or triggering an immediate compaction, to ensure data integrity.

Comment on lines +213 to +216
3. Loop:
a. Check for Claim key at !lst|claim|<key><candidateSeq>
b. If exists: candidateSeq++ and retry (Already claimed by another POP)
c. If not exists:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The POP claim flow uses a linear search (Exists calls) to find the next available sequence. In high-churn scenarios with many uncompacted claims, this results in O(N) latency. Consider using a prefix scan with a limit to find the first unclaimed sequence more efficiently.

2. Claim keys with a sequence less than Head are no longer needed (Head has already passed them).
3. Within the compaction transaction:
- Advance the base meta Head by the number of claimed items.
- Delete corresponding Claim and Item keys.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Deleting all Claim and Item keys in a single compaction transaction may create excessively large transactions for lists with high pop volume. It is recommended to bound the number of deletions per transaction to avoid Raft proposal timeouts or LSM performance issues.


Performing synchronous compaction during a write could cause write conflicts on the base metadata for the compaction transaction itself, introducing retries to what should be a conflict-free `PUSH` path. Read latency degradation due to Delta accumulation will be managed by tuning `scanInterval` and `maxDeltaCount` for Background Compaction.

If Delta accumulation becomes exceptionally high, a warning log will be emitted on the read side to allow operators to adjust compaction parameters.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Relying only on warning logs for high delta accumulation is risky. Following the general rule of applying fixed bounds to collections, the system should enforce a hard limit on unapplied deltas. When reached, the system should either reject new writes or force a synchronous compaction to protect read performance.

References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document describing a “delta + claim key” approach for list metadata updates to reduce write conflicts in Redis list operations within the KV/MVCC storage model.

Changes:

  • Document new key layout for list metadata deltas (!lst|meta|d|...) and claim keys (!lst|claim|...).
  • Describe write/read paths using read-time delta aggregation and conflict-free delta writes.
  • Propose a background compactor phase to fold deltas (and GC claims) into base list metadata.

Comment on lines +44 to +47
- `commitTS` is an 8-byte big-endian timestamp determined at Raft apply time.
- `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`).
- Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans.

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design says commitTS is determined at Raft apply time, but in the current architecture commitTS is resolved by the coordinator during dispatch (see kv.OperationGroup.CommitTS). Since the Delta key embeds commitTS, the doc should clarify how the write path obtains it before generating the key (e.g., explicitly pin CommitTS up front, or have the FSM rewrite Delta keys at apply time).

Suggested change
- `commitTS` is an 8-byte big-endian timestamp determined at Raft apply time.
- `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`).
- Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans.
- `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (for example on `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time.
- `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`).
- Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans.
Because the Delta key embeds `commitTS`, the write path must know the final timestamp before emitting the key bytes. This design therefore assumes `CommitTS` is explicitly allocated once during dispatch and reused during Raft apply; it does not rely on the FSM rewriting Delta keys at apply time.

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +119

// 2. Fetch Deltas via prefix scan
prefix := ListMetaDeltaScanPrefix(userKey)
deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS)

// 3. Aggregate
for _, d := range deltas {
delta := UnmarshalListMetaDelta(d.Value)
baseMeta.Head += delta.HeadDelta
baseMeta.Len += delta.LenDelta
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolveListMeta() code block reads like concrete Go code but currently drops error handling (err is ignored after both loadListMetaAt and ScanAt) and references prefixEnd, while the codebase helper is prefixScanEnd (e.g., adapter/prefix_scan.go). Either add the missing error checks + correct helper name, or mark the block explicitly as simplified pseudocode to avoid an implementer copying a broken version.

Suggested change
// 2. Fetch Deltas via prefix scan
prefix := ListMetaDeltaScanPrefix(userKey)
deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS)
// 3. Aggregate
for _, d := range deltas {
delta := UnmarshalListMetaDelta(d.Value)
baseMeta.Head += delta.HeadDelta
baseMeta.Len += delta.LenDelta
if err != nil {
return ListMeta{}, false, err
}
// 2. Fetch Deltas via prefix scan
prefix := ListMetaDeltaScanPrefix(userKey)
deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS)
if err != nil {
return ListMeta{}, false, err
}
// 3. Aggregate
for _, d := range deltas {
delta := UnmarshalListMetaDelta(d.Value)
baseMeta.Head += delta.HeadDelta
baseMeta.Len += delta.LenDelta

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +246
Accumulated Claim keys do not affect read performance (they use the `!lst|claim|` prefix and are outside the metadata scan scope). GC is handled by the Background Compactor.

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement that accumulated Claim keys “do not affect read performance” conflicts with the later plan to “Skip claimed items” by checking Claim keys during fetchListRange() (which would add extra reads/lookups). Please clarify the intended read-time strategy (e.g., range-scan Claim keys in bulk, embed claim status elsewhere, or accept the additional read amplification).

Suggested change
Accumulated Claim keys do not affect read performance (they use the `!lst|claim|` prefix and are outside the metadata scan scope). GC is handled by the Background Compactor.
Read-time strategy:
- Claim keys are outside the `!lst|meta|` namespace, so they do not affect the metadata-only read path.
- However, `fetchListRange()` must skip logically deleted items. To do that, it should **range-scan Claim keys in bulk** for the candidate sequence interval being materialized, then filter claimed sequences in memory while assembling the result.
- This means Claim keys do introduce bounded read amplification for list reads, but it is **one additional range scan per fetched window**, not one extra point lookup per item.
- Background Compaction keeps this bounded by deleting Claim keys whose sequence is below the effective Head and by collapsing old deltas.
So the intended guarantee is: accumulated Claim keys do not affect metadata scans, but they can affect `fetchListRange()` until compaction removes obsolete claims.

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +172
- Scan the entire `!lst|meta|d|` prefix every `scanInterval`.
- If the number of Deltas for a `userKey` exceeds `maxDeltaCount`, mark it for compaction.
- Compaction is performed as a transaction (`IsTxn: true`), protecting the base metadata read via the `readSet` (using OCC to prevent concurrent compaction conflicts).
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Scan the entire !lst|meta|d| prefix every scanInterval” has unbounded cost as Delta volume grows (O(total deltas) work every interval), which could become a background read hotspot. Consider documenting a more scalable trigger mechanism (e.g., incremental cursored scan, per-shard partitioning, maintaining per-list delta counters, or sampling) so operators don’t face a sudden compactor-induced load spike at scale.

Copilot uses AI. Check for mistakes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 12, 2026

@copilot apply changes based on the comments in this thread

Applied all four review comments in commit 69f4220:

  1. commitTS (lines 44-47): Updated description to clarify it is pinned by the coordinator via kv.OperationGroup.CommitTS during dispatch and carried through Raft unchanged, with a note that the write path must know the final timestamp before generating Delta key bytes.
  2. resolveListMeta() (lines 110-119): Added missing error handling after loadListMetaAt and ScanAt, fixed prefixEndprefixScanEnd (matching adapter/prefix_scan.go), and added a pseudocode disclaimer comment.
  3. Claim key read performance (lines 245-246): Replaced the misleading statement with a detailed read-time strategy section explaining that Claim keys are transparent to metadata scans but add one bulk range scan per fetchListRange() call until compaction removes them.
  4. Compaction scalability (lines 170-172): Replaced the unbounded full-prefix scan with a cursor-based incremental scan (bounded per-tick I/O) plus per-list Delta counters for prioritizing hot lists.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

@copilot apply changes based on the comments in #485 (review)

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 12, 2026

@copilot apply changes based on the comments in #485 (review)

Applied all changes from gemini review 4093919615 in commit cf57136:

  1. Delta key prefix collision (high): Added null-byte separator (\x00) after userKey in ListMetaDeltaKey and updated key layout diagram. ExtractListUserKeyFromDelta suffix length updated 12 → 13.
  2. Claim key prefix collision (high): Same null-byte separator in ListClaimKey and Claim key layout. ExtractListUserKeyFromClaim suffix length updated 8 → 9.
  3. Rolling upgrade strategies (high): Added Phase 6 with three options — feature flag (recommended), blue/green, and dual-write proxy.
  4. maxDeltaScanLimit overflow (medium): Documented that resolveListMeta must return an error when the scan result is truncated, triggering immediate compaction.
  5. POP claim flow efficiency (medium): Replaced O(N) point-lookup loop with bulk prefix scan + in-memory skip-set; scanWindow defined (default: 32).
  6. Bounded compaction transactions (medium): Added maxKeysPerCompactionTx limit (suggested default: 256) with multi-pass scheduling.
  7. Hard limit on unapplied deltas (medium): Clarified the three parameters — maxDeltaCount (soft, 64), maxDeltaScanLimit (= maxDeltaHardLimit, 256) — with hard-limit fallback to synchronous compaction.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design for conflict-free list operations using a Delta pattern for metadata updates and a Claim mechanism for concurrent pops. The review feedback highlights several critical areas for refinement: ensuring PUSH operations clear stale claim keys to prevent list corruption, ensuring the compactor only advances the list Head through contiguous claimed sequences to avoid data loss, and explicitly defining the RPOP flow and garbage collection logic for the Tail of the list to support symmetric operations and prevent resource leaks.

1. Read !lst|meta|<key> ← Necessary (for seq calculation), but NOT registered in readSet
2. Scan !lst|meta|d|<key> ← Read unapplied deltas to recalculate head/len
3. Put !lst|itm|<key><seq> ... ← Item write (unique key)
4. Put !lst|meta|d|<key><commitTS><seqInTxn> ← Delta write (unique key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PUSH flow should explicitly include a deletion of the claim key for the sequence being occupied. Since sequences are recycled after POP operations, a stale claim key might exist for the sequence being reused. If not removed during the PUSH operation, a subsequent POP will see the stale claim and incorrectly skip the item, leading to orphaned items and list corruption. Additionally, as this design involves a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., dual-write proxy) to avoid service interruption.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

3. Within the compaction transaction (bounded to at most `maxKeysPerCompactionTx` deletions
to avoid Raft proposal timeouts or LSM performance issues; suggested default: 256,
chosen to keep proposal sizes well under the typical 1 MiB Raft entry limit):
- Advance the base meta Head by the number of claimed items.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When advancing the base metadata Head during compaction, the compactor must only proceed through contiguous claimed sequences starting from the current Head. Advancing the Head past an unclaimed sequence (a gap) would logically delete items that have not been popped, causing data loss. Since this identifies a significant operational risk, the design must also detail potential mitigation strategies, such as implementing a temporary "bridge" or "proxy" mode, to ensure safety during rolling upgrades.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

#### 6.2. POP Claim Flow (LPOP example)

```
For LPOP:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design should explicitly define the RPOP claim flow. Unlike LPOP which searches forward from Head, RPOP must search backward from Tail - 1. This requires a reverse range scan for claims within the window, and the logic for picking the candidate sequence differs from the LPOP example provided. Clarifying this will prevent implementation errors regarding sequence selection.


```
1. Determine the base meta Head for the target userKey.
2. Claim keys with a sequence less than Head are no longer needed (Head has already passed them).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Garbage collection of claim keys should also handle the right side of the list. Items and claims with a sequence greater than or equal to the effective Tail (resulting from RPOP operations) are obsolete and should be cleaned up by the compactor, similar to how sequences less than Head are handled. Restricting GC to only sequences less than Head will lead to resource leaks for RPOP-heavy workloads.

bootjp added a commit that referenced this pull request Apr 12, 2026
…review

- Address PR #485 gemini review (4094524421):
  - Add stale claim key cleanup in PUSH write path
  - Enforce contiguous-only Head/Tail advancement in compaction
  - Add explicit RPOP claim flow (reverse scan from Tail-1)
  - Add Tail-side GC for RPOP claim keys

- Extend delta metadata pattern to Hash, Set, and ZSet:
  - Hash: wide-column decomposition + delta
  - Set: wide-column decomposition + delta
  - ZSet: delta on existing wide-column (PR #483)
  - Unified compactor, shared delta limits, transition plan per type

- Rename list_metadata_delta_design.md to collection_metadata_delta_design.md
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design for a Delta-based metadata pattern for Redis collections (List, Hash, Set, and ZSet) to mitigate write-write conflicts. The design incorporates wide-column decomposition and a claim mechanism for List operations. Feedback identifies critical issues regarding binary safety in key layouts, potential race conditions during the deletion of base metadata in background compaction, and missing truncation checks in the List metadata resolution logic.

Comment on lines +77 to +81
> **Note on separator**: The null byte (`\x00`) between `userKey` and the fixed-length suffix prevents prefix-collision bugs where scanning for `userKey = "foo"` would incorrectly include keys for `userKey = "foobar"`.

- `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (via `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time.
- `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`).
- Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a null byte (\x00) separator for List Delta keys is inconsistent with the length-prefix approach (userKeyLen) adopted for Hash, Set, and ZSet types. While a null byte prevents collisions for standard strings, it is insufficient for binary-safe keys because the key itself can contain null bytes. Furthermore, changing the key format presents a significant operational risk for rolling upgrades and requires a clear migration strategy. The design must detail mitigation strategies (e.g., a 'bridge' mode) and a plan for live, zero-downtime cutover (e.g., dual-write proxy or logical replication) to avoid service interruption during the transition.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

For all collection types, empty collection deletion is deferred to Background Compaction (same reasoning as List Section 11.3):

- Immediate deletion would require writing to the base metadata, risking inconsistency with concurrent Delta writes.
- When the compactor detects `Len == 0`, it atomically deletes all keys for that collection (base metadata, deltas, data keys, and for List: claim keys).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Deleting the baseMeta and data keys (items, fields, members) during background compaction when Len == 0 is dangerous in this lock-free design. Since writers do not register the baseMeta key in their readSet, a concurrent pusher might calculate a sequence number based on the existing metadata and unapplied deltas. If the compactor deletes the baseMeta before the pusher commits, the next pusher will see no metadata and start sequence numbering from zero, leading to sequence collisions or a corrupted collection structure. This represents a significant operational risk; the design must detail mitigation strategies to maintain correctness, such as updating the baseMeta to reflect Len=0 rather than deleting the key entirely, or implementing a safety bridge for concurrent operations.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

Comment on lines +163 to +168
deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS)
if err != nil {
return ListMeta{}, false, err
}

// 3. Aggregate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The resolveListMeta pseudocode is missing the truncation check for maxDeltaScanLimit. As noted in Section 4 and implemented in the Hash version (Section 12), the function must return an error if len(deltas) == maxDeltaScanLimit to prevent returning incorrect metadata when deltas are truncated.

Suggested change
deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS)
if err != nil {
return ListMeta{}, false, err
}
// 3. Aggregate
deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS)
if err != nil {
return ListMeta{}, false, err
}
if len(deltas) == maxDeltaScanLimit {
return ListMeta{}, false, ErrDeltaScanTruncated
}
// 3. Aggregate

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design for a Delta-based metadata management system for Redis collections, aimed at eliminating write conflicts during concurrent operations across List, Hash, Set, and ZSet types. The design utilizes wide-column decomposition and a Claim mechanism to ensure conflict-free throughput. Review feedback identifies several critical issues, including potential livelocks during synchronous compaction, the risk of Delta key collisions if commitTS uniqueness is not guaranteed, and concerns regarding Redis compatibility when using tombstones for empty collections. Additionally, the reviewer suggested optimizing performance for sequence checks via range scans and requested more detailed strategies for zero-downtime migrations.


When the number of Deltas is small (< 100), the cost of a Prefix Scan is negligible. Since Delta keys are physically contiguous in the LSM tree, I/O can be performed in a single sequential read.

**`maxDeltaScanLimit` overflow**: If the number of unapplied Deltas exceeds `maxDeltaScanLimit`, `resolveListMeta` cannot aggregate them all in a single scan pass, which would produce an incorrect `ListMeta`. To preserve correctness, `resolveListMeta` must return an error when the scan result is truncated (i.e., when `len(deltas) == maxDeltaScanLimit`). The caller should then either surface the error or trigger an immediate synchronous compaction before retrying. This behaviour is the enforcement backstop for the hard-limit policy described in Section 11.1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Triggering an immediate synchronous compaction when maxDeltaScanLimit is reached poses a risk of livelock under high concurrency. Since multiple writers hitting the limit will all attempt to compact the same metadata key, they will likely conflict with each other on the baseMeta write. As this represents a significant operational risk, the design must detail potential mitigation strategies, such as serializing these compaction attempts or implementing a backoff strategy, to ensure system progress.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.


- Immediate deletion would require writing to the base metadata, risking inconsistency with concurrent Delta writes.
- When the compactor detects `Len == 0`, it performs the following in a single transaction:
1. **Update** the base metadata to `Len = 0` (and for List: `Head`/`Tail` reflecting the final state). The base metadata key is **not deleted** — it is retained as a tombstone to prevent concurrent writers from misinterpreting a missing metadata key as a fresh collection and restarting sequence numbering from zero.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Retaining the base metadata key as a tombstone with Len = 0 deviates from standard Redis behavior. Furthermore, the 'restart from zero' approach mentioned here must be evaluated against production migration requirements. For a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green or dual-write proxy) to avoid service interruption, rather than relying on clearing state.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

```

- `userKeyLen` is a 4-byte big-endian length prefix that unambiguously separates `userKey` from the fixed-length suffix, even when `userKey` contains arbitrary bytes (including null bytes). This is the same approach used by Hash, Set, and ZSet key layouts, ensuring consistency across all collection types and full binary safety.
- `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (via `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design relies on commitTS being part of the Delta key to avoid write conflicts. However, if multiple coordinators can concurrently process requests for the same userKey, there is a risk of Delta key collisions if they happen to allocate the same commitTS. While HLCs generally provide uniqueness, the document should explicitly state the requirement for commitTS to be globally unique per transaction (or at least per userKey) to prevent silent overwrites of Delta entries.

Comment on lines +134 to +137
3. For each target sequence, check for stale Claim keys:
- Scan !lst|claim|<keyLen><key>[seq …] for sequences being written
- If a stale Claim key exists (left over from a prior POP on a recycled sequence),
emit a Del for that Claim key in the same transaction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The "New Flow" for RPUSH mentions checking for stale Claim keys for each target sequence. To maintain high throughput, this check should be implemented as a single range scan covering all sequences in the batch (e.g., [Tail, Tail + numItems)) rather than individual point lookups or scans per item. This minimizes the number of round-trips to the storage engine.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request outlines a design for implementing a Delta pattern across Redis collection types to mitigate write conflicts and improve concurrency. The design includes wide-column decomposition for Hash and Set types and a CAS-based Claim mechanism for List operations. Review feedback suggests refining the LPOP and RPOP claim loops with explicit boundary checks to ensure termination and addressing a potential race condition in the background compactor's bulk deletion logic for Hash, Set, and ZSet data keys.

- Put !lst|claim|<keyLen><key><candidateSeq> → {claimerTS} (Write Claim)
- Put !lst|meta|d|<keyLen><key><commitTS><seqInTxn(4)> → {HeadDelta: +1, LenDelta: -1}
- Commit via dispatchElems()
If no candidate found in window: advance window and repeat from step 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LPOP claim loop should explicitly check the candidateSeq against meta.Tail to ensure it does not scan beyond the current logical end of the list or enter an infinite loop if all items in the range are claimed.

Suggested change
If no candidate found in window: advance window and repeat from step 3
If no candidate found in window: advance window and repeat from step 3 (stop if candidateSeq >= meta.Tail)

- Put !lst|claim|<keyLen><key><candidateSeq> → {claimerTS} (Write Claim)
- Put !lst|meta|d|<keyLen><key><commitTS><seqInTxn(4)> → {HeadDelta: 0, LenDelta: -1}
- Commit via dispatchElems()
If no candidate found in window: retreat window and repeat from step 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly to LPOP, the RPOP claim loop should check candidateSeq against meta.Head to ensure it terminates if no unclaimed items are found within the logical bounds of the list.

Suggested change
If no candidate found in window: retreat window and repeat from step 3
If no candidate found in window: retreat window and repeat from step 3 (stop if candidateSeq < meta.Head)

- When the compactor detects `Len == 0`, it performs the following in a single transaction:
1. **Update** the base metadata to `Len = 0` (and for List: `Head`/`Tail` reflecting the final state). The base metadata key is **not deleted** — it is retained as a tombstone to prevent concurrent writers from misinterpreting a missing metadata key as a fresh collection and restarting sequence numbering from zero.
2. Delete all applied Delta keys.
3. Delete all data keys (items for List, fields for Hash, members for Set, member+score keys for ZSet).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Bulk deletion of data keys for Hash, Set, and ZSet during background compaction introduces a race condition. Since writers (HSET, SADD, ZADD) do not register the base metadata in their readSet, the compactor could perform a prefix delete that removes keys added by concurrent transactions that have not yet been merged into the base metadata. For these types, it is safer to rely on the individual mutation operations (HDEL, SREM, ZREM) which already handle data key deletion.

Suggested change
3. Delete all data keys (items for List, fields for Hash, members for Set, member+score keys for ZSet).
3. For List: delete corresponding Item keys for the contiguous range only. For Hash, Set, and ZSet, data keys are already deleted by mutation operations (HDEL/SREM/ZREM); the compactor should avoid bulk prefix deletion to prevent races with concurrent writers.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design for a Delta-based metadata pattern across all Redis collection types to eliminate write conflicts during concurrent operations. The proposal includes wide-column decomposition for Hash and Set types and a CAS-based claim mechanism for List POP operations to ensure conflict-free throughput. The review feedback identifies critical integer overflow vulnerabilities in key extraction functions and points out inconsistencies regarding data key cleanup for empty collections. Additionally, the feedback emphasizes the need for a detailed migration and zero-downtime cutover strategy to mitigate operational risks associated with changes to the base metadata key format.

Comment on lines +70 to +71
Base Metadata (Existing):
!lst|meta|<userKeyLen(4)><userKey> → [Head(8)][Tail(8)][Len(8)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The document labels the base metadata as (Existing) but describes a new key format !lst|meta|<userKeyLen(4)><userKey>. The current implementation in store/list_helpers.go uses !lst|meta|<userKey>. This change in key format introduces a significant operational risk for rolling upgrades. Per repository rules, please detail potential mitigation strategies (e.g., a temporary "bridge" or "proxy" mode) and strategies for zero-downtime cutover (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the migration.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment on lines +429 to +439
func ExtractListUserKeyFromDelta(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix))
if len(trimmed) < 4+8+4 { // 4(userKeyLen) + 8(commitTS) + 4(seqInTxn)
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
if uint32(len(trimmed)) < 4+ukLen+8+4 {
return nil
}
return trimmed[4 : 4+ukLen]
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The length check if uint32(len(trimmed)) < 4+ukLen+8+4 is vulnerable to an integer overflow if ukLen is a very large uint32. This can lead to a panic due to an out-of-bounds slice access. A safer way to perform this check is to avoid arithmetic that can overflow.

Suggested change
func ExtractListUserKeyFromDelta(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix))
if len(trimmed) < 4+8+4 { // 4(userKeyLen) + 8(commitTS) + 4(seqInTxn)
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
if uint32(len(trimmed)) < 4+ukLen+8+4 {
return nil
}
return trimmed[4 : 4+ukLen]
}
func ExtractListUserKeyFromDelta(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix))
const suffixLen = 8 + 4 // commitTS + seqInTxn
if len(trimmed) < 4+suffixLen { // 4(userKeyLen) + suffix
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
// Check if the declared userKey length would exceed the available buffer length.
if ukLen > uint32(len(trimmed)-(4+suffixLen)) {
return nil
}
return trimmed[4 : 4+ukLen]
}

Comment on lines +441 to +451
func ExtractListUserKeyFromClaim(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix))
if len(trimmed) < 4+8 { // 4(userKeyLen) + 8(seq)
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
if uint32(len(trimmed)) < 4+ukLen+8 {
return nil
}
return trimmed[4 : 4+ukLen]
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Similar to ExtractListUserKeyFromDelta, the length check if uint32(len(trimmed)) < 4+ukLen+8 is vulnerable to an integer overflow if ukLen is large. This can cause a panic. The check should be rewritten to be safe against overflows.

Suggested change
func ExtractListUserKeyFromClaim(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix))
if len(trimmed) < 4+8 { // 4(userKeyLen) + 8(seq)
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
if uint32(len(trimmed)) < 4+ukLen+8 {
return nil
}
return trimmed[4 : 4+ukLen]
}
func ExtractListUserKeyFromClaim(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix))
const suffixLen = 8 // seq
if len(trimmed) < 4+suffixLen { // 4(userKeyLen) + suffix
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
// Check if the declared userKey length would exceed the available buffer length.
if ukLen > uint32(len(trimmed)-(4+suffixLen)) {
return nil
}
return trimmed[4 : 4+ukLen]
}

Comment on lines +238 to +239
- Before deleting Deltas, the worker ensures their `commitTS` is older than `ActiveTimestampTracker.Oldest()` to avoid breaking in-flight reads.
- Deltas within the MVCC retention window are not deleted to guarantee consistency for historical reads.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The conditions for deleting deltas during compaction could be stated more clearly. The document lists two conditions: being older than ActiveTimestampTracker.Oldest() and being outside the MVCC retention window. It would be clearer to state that a delta is eligible for deletion only if its commitTS is older than both the oldest active read timestamp and the MVCC retention window boundary. For example: "To be eligible for deletion, a Delta's commitTS must be older than both ActiveTimestampTracker.Oldest() and the MVCC retention window boundary (now - retention_window)."


Immediate cleanup will not occur even if `Len == 0` after aggregating Deltas. When Background Compaction detects `Len == 0`:
- The base metadata key is **updated** to reflect `Len = 0` (not deleted). This is critical because writers do not register the base metadata in their `readSet`. If the key were deleted, a concurrent writer would see no metadata and incorrectly assume a fresh collection, restarting sequence numbering from zero and causing sequence collisions or data corruption.
- All applied Deltas, data keys (items, fields, members, score keys), and Claim keys (for List) are deleted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in the description of empty collection cleanup. This line states that "All applied Deltas, data keys (items, fields, members, score keys), and Claim keys (for List) are deleted" when Len == 0. However, other sections (e.g., Section 13 for Hashes) correctly state that data keys for Hash, Set, and ZSet are not deleted by the compactor to avoid race conditions. Please clarify this statement to reflect that data keys are only deleted for Lists during compaction. The more detailed explanation in Section 28 is correct and should be reflected here for consistency.

@bootjp bootjp changed the title Adding a design document for delta-forming list metadata. Adding a design document for delta-forming collection metadata. Apr 12, 2026
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design for a Delta + Claim pattern to resolve write conflicts on Redis collection metadata (List, Hash, Set, and ZSet). The design includes wide-column decomposition for Hash and Set types, a background compaction mechanism, and a detailed transition plan. Feedback focuses on correcting internal section references regarding delta limits, addressing a potential race condition in the lazy tombstone cleanup strategy for empty collections, and suggesting an additional helper function for user key extraction to maintain consistency with the new metadata format.

}
return trimmed[4 : 4+ukLen]
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The design introduces a new base metadata key format with a length prefix (!lst|meta|<userKeyLen(4)>). To maintain consistency and support the migration, an updated helper function for extracting the user key from the base metadata should be included in Section 8. Furthermore, as this is a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption.

Suggested change
```
func ExtractListUserKeyFromBaseMeta(key []byte) []byte {
trimmed := bytes.TrimPrefix(key, []byte(ListMetaPrefix))
if len(trimmed) < 4 {
return nil
}
ukLen := binary.BigEndian.Uint32(trimmed[:4])
if uint32(len(trimmed)) < 4+ukLen {
return nil
}
return trimmed[4 : 4+ukLen]
}
References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

- `DEL key` / `UNLINK key`: performs full metadata deletion (including the base metadata key) via the standard transactional `DEL` flow with proper `readSet` registration. This is the only path that removes the base metadata key.
- On first write to a `Len = 0` tombstone (e.g., `RPUSH` to an empty list): the writer sees the existing metadata with `Len = 0` and correctly resumes sequence numbering from the current `Head`/`Tail`, avoiding the sequence-collision problem that would occur if the key had been fully deleted.

**Lazy tombstone cleanup**: To prevent indefinite accumulation of `Len = 0` tombstone metadata keys, the Background Compactor tracks how long a key has remained at `Len = 0`. If a tombstone persists beyond a configurable TTL (default: 24 hours) with no new writes, the compactor may perform a full deletion as a leader-only operation with `readSet` registration, behaving like an implicit `DEL`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition in the "Lazy tombstone cleanup" strategy. Since writers do not register the base metadata key in their readSet, a concurrent writer could start an operation after the compactor has decided to delete the tombstone but before the deletion is committed, leading to sequence collisions. This represents a significant operational risk. The design must detail potential mitigation strategies, such as implementing a temporary "bridge" or "proxy" mode, to ensure that no writers are active or to prevent sequence reuse if the metadata key is fully removed.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.


When the number of Deltas is small (< 100), the cost of a Prefix Scan is negligible. Since Delta keys are physically contiguous in the LSM tree, I/O can be performed in a single sequential read.

**`maxDeltaScanLimit` overflow**: If the number of unapplied Deltas exceeds `maxDeltaScanLimit`, `resolveListMeta` cannot aggregate them all in a single scan pass, which would produce an incorrect `ListMeta`. To preserve correctness, `resolveListMeta` must return an error when the scan result is truncated (i.e., when `len(deltas) == maxDeltaScanLimit`). The caller should then either surface the error or trigger an immediate synchronous compaction before retrying. This behaviour is the enforcement backstop for the hard-limit policy described in Section 11.1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reference to "Section 11.1" appears to be incorrect. Based on the context of limits and hard-limit policies, this should likely refer to Design Decision D1 (Line 1152).


### 27. Shared Delta Limits

The delta accumulation limits from List Section 11.1 apply uniformly to all collection types:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reference to "List Section 11.1" is incorrect as Part I (List) ends at Section 8. This should refer to Design Decision D1 (Line 1152) which defines the shared delta limits.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design for a Delta + Claim pattern to resolve write conflicts on Redis collection metadata (List, Hash, Set, and ZSet). The design transitions Hash and Set to a wide-column format and uses individual Delta keys to avoid Read-Modify-Write (RMW) bottlenecks on a single metadata key. Feedback focuses on optimizing the scanWindow for LPOP operations to handle high-throughput scenarios, ensuring commitTS uniqueness during leader transitions, and addressing potential consistency issues in MULTI/EXEC transactions when metadata is read without readSet registration. Additionally, there are suggestions to refine the migration strategy for legacy data and to implement a lazy reaping mechanism for Len = 0 tombstones to prevent long-term metadata storage leaks.

1. resolveListMeta(key, readTS) → Effective meta (Determine Head, Len)
2. candidateSeq = meta.Head
3. Bulk-scan existing Claim keys in range [candidateSeq, candidateSeq+scanWindow):
- scanWindow is a configurable constant (default: 32) that determines how many
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A scanWindow of 32 might be too small for high-throughput workloads, especially if background compaction lags. If thousands of items are popped but not yet compacted, LPOP could require many sequential range scans to find the next available item, leading to significant latency spikes. Consider increasing the default window size (e.g., to 128 or 256) to reduce read amplification, as Pebble range scans are highly efficient for these sizes.

Suggested change
- scanWindow is a configurable constant (default: 32) that determines how many
- scanWindow is a configurable constant (default: 128) that determines how many

a. First batch: write !hs|meta|<key> with Len, create first N field keys, keep legacy blob.
b. Subsequent batches: create next N field keys each.
c. Final batch: create remaining field keys, delete legacy !redis|hash|<key>.
- During batched migration, reads must merge wide-column fields with remaining legacy fields.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

During the batched migration phase for Hash and Set, the design should explicitly define the behavior for concurrent writes (HSET, SADD) and deletes (HDEL, SREM). To ensure a zero-downtime cutover and avoid service interruption, the plan must detail strategies such as a dual-write proxy or logical replication. Specifically, writes should target the new wide-column format immediately, and the "merge" logic in reads must prioritize wide-column field keys over the legacy blob. Additionally, deletes must check both formats to avoid leaving orphaned data in the legacy blob.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

> **Note on base metadata format**: The base metadata key retains the existing format `!lst|meta|<userKey>` without a length prefix. No migration of the base metadata key is required. Only the newly introduced Delta and Claim keys use the `userKeyLen(4)` length-prefix convention (consistent with Hash, Set, and ZSet). This avoids any ambiguity in key parsing during rolling upgrades — the base metadata prefix `!lst|meta|` is distinct from the Delta prefix `!lst|dlt|`, so there is no collision risk.

- `userKeyLen` (in Delta/Claim keys only) is a 4-byte big-endian length prefix that unambiguously separates `userKey` from the fixed-length suffix, even when `userKey` contains arbitrary bytes (including null bytes).
- `commitTS` is an 8-byte big-endian timestamp allocated by the **Raft group leader during dispatch** (before the Raft proposal is created). **`commitTS` must be globally unique per transaction** — the combination of `commitTS + seqInTxn` forms the Delta key's identity, and a collision would silently overwrite a prior Delta entry, causing metadata corruption. In this system, all write operations are routed to the leader via `proxyToLeader()` — the leader IS the coordinator for writes. The leader allocates `commitTS` from its HLC (Hybrid Logical Clock) during dispatch, embeds it in the Delta key bytes, and includes the pre-built keys in the Raft proposal. At apply time, the FSM uses these keys as-is without rewriting. Since Raft's safety property ensures at most one leader exists at a time, and the leader's HLC monotonically advances, `commitTS` uniqueness is guaranteed without requiring a node identifier.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To guarantee commitTS uniqueness across leader transitions, the design should explicitly state that a new leader must initialize its HLC to a value strictly greater than the maximum commitTS present in its Raft log. Without this initialization, a new leader with a lagging system clock could potentially issue duplicate timestamps for the same user key, leading to silent metadata corruption in the LSM tree due to key collisions.

}
```

- In `buildListElems()`, replace metadata `Put` with Delta `Put`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The integration with MULTI/EXEC should clarify the consistency guarantees for transactions that read metadata (e.g., HLEN, LLEN) and then perform conditional logic. Since the design avoids readSet registration on the base metadata to improve throughput, such transactions may observe "phantom" updates if concurrent writers add deltas. If strict serializability is required for these cases, the design may need to support optional read-set tracking for the delta range or the base metadata key.

- `DEL key` / `UNLINK key`: performs full deletion via the standard transactional `DEL` flow with proper `readSet` registration. This must delete **all** associated keys: base metadata, all Delta keys (`!*|dlt|` / `!*|meta|d|` namespace), all Claim keys (for List), and all data keys (items, fields, members, score keys). Without cleaning up Delta keys, a subsequently recreated collection could aggregate stale deltas from a prior incarnation.
- On first write to a `Len = 0` tombstone (e.g., `RPUSH` to an empty list): the writer sees the existing metadata with `Len = 0` and correctly resumes sequence numbering from the current `Head`/`Tail`, avoiding the sequence-collision problem that would occur if the key had been fully deleted.

**No lazy tombstone cleanup**: The `Len = 0` tombstone is retained indefinitely until an explicit `DEL` command removes it. Automatic deletion of the base metadata key by the compactor is intentionally **not** performed, because it introduces an unresolvable race condition: since writers do not register the base metadata in their `readSet`, a concurrent writer could commit a new Delta between the compactor's "no deltas exist" check and its deletion of the base metadata. Registering the delta prefix scan in the compactor's `readSet` would not help because the writer's delta key is unique and was not yet visible at scan time. The storage overhead of retaining one tombstone key (24 bytes for List, 8 bytes for Hash/Set/ZSet) per empty collection is negligible compared to the data integrity risk of automatic deletion.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While retaining Len = 0 tombstones is necessary for correctness during concurrent operations, an indefinite accumulation of these keys for transient collections could lead to a slow leak in metadata storage. Consider a "lazy reaping" mechanism where tombstones older than the MVCC retention window with no associated deltas can be safely removed during a specialized compaction pass or a full DEL scan.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 12, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

bootjp and others added 6 commits April 12, 2026 22:44
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…review

- Address PR #485 gemini review (4094524421):
  - Add stale claim key cleanup in PUSH write path
  - Enforce contiguous-only Head/Tail advancement in compaction
  - Add explicit RPOP claim flow (reverse scan from Tail-1)
  - Add Tail-side GC for RPOP claim keys

- Extend delta metadata pattern to Hash, Set, and ZSet:
  - Hash: wide-column decomposition + delta
  - Set: wide-column decomposition + delta
  - ZSet: delta on existing wide-column (PR #483)
  - Unified compactor, shared delta limits, transition plan per type

- Rename list_metadata_delta_design.md to collection_metadata_delta_design.md
…empty collection safety, add truncation check

- Unify List delta/claim key format from null-byte separator to
  userKeyLen(4) length-prefix, consistent with Hash/Set/ZSet and
  binary-safe for keys containing null bytes
- Fix empty collection compaction: update baseMeta to Len=0 instead
  of deleting it, preventing sequence collision when concurrent
  writers see missing metadata
- Add missing maxDeltaScanLimit truncation check to resolveListMeta
  pseudocode, matching the Hash resolveHashMeta implementation
@bootjp bootjp force-pushed the feature/add-list-delta-design-doc branch from 8d42195 to ab2555b Compare April 12, 2026 13:45
bootjp added 7 commits April 12, 2026 22:47
- Increase LPOP scanWindow default from 32 to 128 to reduce read
  amplification under high-throughput POP workloads
- Document HLC leader initialization invariant ensuring commitTS
  uniqueness across leader transitions
- Clarify concurrent write behavior during Hash/Set legacy migration
  (atomic migration + OCC retry semantics)
- Document MULTI/EXEC consistency guarantees for metadata reads
  (snapshot isolation, optional strict serializability opt-in)
- Add lazy reaping mechanism for Len=0 tombstones to prevent
  metadata storage leak from transient collections
Add ListMetaDelta struct with fixed 16-byte binary marshal/unmarshal,
Delta key helpers (ListMetaDeltaKey, ListMetaDeltaScanPrefix,
IsListMetaDeltaKey, ExtractListUserKeyFromDelta), Claim key helpers
(ListClaimKey, ListClaimScanPrefix, IsListClaimKey,
ExtractListUserKeyFromClaim), and PrefixScanEnd utility.

Fix IsListMetaKey to exclude Delta keys (both share the !lst|meta|
prefix). Add comprehensive unit tests for all new helpers including
ordering, prefix scan bounds, binary safety, and edge cases.
- Add resolveListMeta() that aggregates base metadata + unapplied deltas
- Add allocateCommitTS() and dispatchElemsWithCommitTS() for Delta key gen
- Modify buildRPushOps/buildLPushOps to emit Delta keys instead of base meta
- Switch listRPush/listLPush to use resolveListMeta + pre-allocated commitTS
- Update all read paths (rangeList, isListKeyAt, listValuesAt) to resolve deltas
- Update rawKeyTypeAt to detect delta-only lists via resolveListMeta
- Update deleteLogicalKeyElems to clean up delta and claim keys
- MULTI/EXEC buildListElems still writes base metadata (delta support deferred)

Known limitation: 2 MULTI/EXEC tests fail because the txn path writes base
metadata while standalone ops write deltas, causing stale delta accumulation
after DEL+RPUSH sequences. Full MULTI/EXEC delta support requires Phase L2
transaction integration.
- Add resolveListMeta() that aggregates base metadata + unapplied deltas
- Add allocateCommitTS() and dispatchElemsWithCommitTS()
- Modify buildRPushOps/buildLPushOps to emit Delta keys
- Switch listRPush/listLPush to resolveListMeta + pre-allocated commitTS
- Switch all read paths to resolveListMeta: rangeList, listValuesAt,
  isListKeyAt, loadListState (lua context)
- Add delta-only list detection in rawKeyTypeAt (fallback after point reads)
- isListKeyAt uses fast two-stage check: point read then limit-1 scan
- Add listDeleteAllElems helper for cleanup of delta/claim keys on DEL
- MULTI/EXEC buildListElems still writes base metadata (delta deferred)
Implement a cursor-based incremental compactor that periodically scans
the list delta key space and folds accumulated deltas into base metadata
when the per-key delta count exceeds a configurable threshold.

The compactor:
- Scans !lst|meta|d| prefix incrementally (maxKeysPerTick per pass)
- Groups delta keys by user key and checks against maxDeltaCount
- Folds deltas into base metadata via a single OCC transaction
- Deletes all applied delta keys atomically
- Skips when not leader to avoid redundant work
- Configurable interval, threshold, and timeout

This enables the Delta pattern to run in production without unbounded
delta accumulation degrading resolveListMeta performance.
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

[Data Loss] レビュー結果

対象PR: #485 (List Delta Compactor + delta write path)


🔴 重大: DEL 実行後にリストが「復活」するリスク

場所: adapter/redis_compat_helpers.golistDeleteAllElems()

// Delete delta keys.
deltaPrefix := store.ListMetaDeltaScanPrefix(key)
deltas, derr := r.store.ScanAt(ctx, deltaPrefix, store.PrefixScanEnd(deltaPrefix), maxDeltaScanLimit, readTS)
// ↑ maxDeltaScanLimit = 10000 件で打ち切られる

for _, d := range deltas {
    elems = append(elems, &kv.Elem[kv.OP]{Op: kv.Del, Key: bytes.Clone(d.Key)})
}

// Claim keys も同様に 10000 件で打ち切られる
claims, cerr := r.store.ScanAt(ctx, claimPrefix, store.PrefixScanEnd(claimPrefix), maxDeltaScanLimit, readTS)

DEL コマンドはベースメタキー (!lst|meta|<key>) と全アイテムキーを削除するが、delta / claim キーの削除は maxDeltaScanLimit = 10000 件で打ち切られる。

10000 件を超える delta または claim キーが存在するリストに対して DEL を実行すると、ベースメタは削除されるが一部の delta キーが孤立したまま残存する。その後 resolveListMeta() が呼ばれた際:

func (r *RedisServer) resolveListMeta(...) (store.ListMeta, bool, error) {
    baseMeta, exists, err := r.loadListMetaAt(...)  // → exists = false (削除済み)
    deltas, err := r.store.ScanAt(...)              // → 孤立 delta が残っている
    ...
    return baseMeta, exists || len(deltas) > 0, nil  // ← true になってしまう!
}

結果: 削除したはずのリストが exists = true として返され、かつ Head/Len/Tail は孤立 delta の累積値で汚染された状態になる。ユーザーからは削除済みのリストが異常な状態で再出現して見える。

発生条件: コンパクターが長時間停止した状態で高頻度 PUSH が続いた場合。通常はソフトリミット 64 件、ハードリミット 256 件で制御されるが、コンパクター異常時は蓄積が止まらない。

修正方針: listDeleteAllElems 内で delta/claim の削除を複数バッチに分けてすべて削除するか、または DEL 専用のフラッシュ処理(全スキャンループ)を実装する。


🟡 compactList() での delta 打ち切り未検知

場所: adapter/list_delta_compactor.gocompactList()

deltas, err := c.store.ScanAt(ctx, prefix, store.PrefixScanEnd(prefix), maxDeltaScanLimit, readTS)
// len(deltas) == maxDeltaScanLimit の場合のチェックがない

resolveListMeta() では len(deltas) == maxDeltaScanLimit でエラーを返すが、compactList() では同チェックがないため、先頭 10000 件だけを集約した状態で「完全なコンパクション完了」として base meta を書き込んでしまう可能性がある。10000 件を超える delta が蓄積している場合、残存 delta との二重集約が起きて Head/Len が不正になりうる。


✅ その他の観点 — 問題なし

  • delta キーの一意性 (commitTS + seqInTxn) によるコンカレント PUSH での衝突回避: 設計通り
  • resolveListMeta()maxDeltaScanLimit 到達時エラー返却: 適切
  • LPUSH のシーケンス番号アンダーフロー防止チェック: 正しく実装されている
  • コンパクション OCC 保護 (IsTxn: true, StartTS: readTS): 正しい
  • コンパクターのリーダー限定実行: 正しい

@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

[Consistency] レビュー結果

評価: 重大な懸念事項あり ⚠️


問題1: Raftを通さないデルタキー生成

HLC (Hybrid Logical Clock) から直接タイムスタンプを取得してデルタキーを生成する設計(allocateCommitTS())は、Raftで順序付けられた commitTS に基づかないため、以下のリスクがあります。

  • マルチノード環境でのタイムスタンプ競合 → デルタキーの一意性が損なわれる可能性
  • フェイルオーバー時に HLC が巻き戻ると、デルタキーが再生成される恐れがある
  • デルタキー(commitTS(8) + seqInTxn(4))のソート順序がRaft順序と一致しない場合、線形化可能性が保証されない

問題2: resolveListMeta() の読取一貫性

基本メタデータとデルタを2回の独立したスキャンで取得する構造になっており、以下の懸念があります。

  • 2回のスキャンが同一の readTS スナップショットで実行される保証が設計上明確でない
  • デルタの集約(Head += delta.HeadDelta)はスキャン順序に依存するため、順序の保証が必要

問題3: 背景コンパクタとRaft統合

ListDeltaCompactor が古いデルタをマージ・削除する際、Raftを通さない背景操作であれば以下のリスクがあります。

  • 圧縮コミットがRaft経由でない場合、レプリカへの同期タイミングが未定義
  • TTLキー(redisTTLKey)との一貫性が圧縮後に保たれることを明示的に保証するロジックが見当たらない

推奨対応

  1. デルタキーのタイムスタンプがRaft順序に基づくことを設計文書で明示的に保証する
  2. resolveListMeta() の2回スキャンが同一 MVCC スナップショットを使用することを確認する
  3. ListDeltaCompactor の圧縮操作がRaftエントリとして実行されることを明確にする

bootjp added 3 commits April 13, 2026 03:40
…ase L5)

- TestConcurrentRPush_EventualSuccess: concurrent RPUSH with retries
  on item key conflicts verifies eventual success
- TestConcurrentLPush_EventualSuccess: same for LPUSH
- TestConcurrentRPush_ThenCompact: data integrity after compaction
- TestResolveListMeta_ScalesWithDeltaCount: correctness at 1/10/50/100 deltas

Item key conflicts are expected under concurrent PUSH since writers
compute sequences from stale snapshots. The Delta pattern eliminates
metadata key conflicts; item key conflicts require retry at the caller.
When buildListElems writes base metadata (MULTI/EXEC path), atomically
delete all existing delta keys for that list. Without this, standalone
RPUSH deltas accumulate and get double-counted by resolveListMeta after
MULTI/EXEC writes base metadata containing the aggregated values.

Also clean up delta and claim keys in DEL and purge paths within
MULTI/EXEC transactions.

This fixes a Jepsen-detected consistency violation (G-single-item-realtime,
incompatible-order) where elements appeared to be lost due to incorrect
metadata after mixed standalone/MULTI operations.
Address @claude Data Loss and Consistency reviews:

Data Loss fixes:
- listDeleteAllElems now loops with batched scans to delete ALL delta
  and claim keys, preventing truncation at maxDeltaScanLimit (10000)
  that could leave orphaned deltas causing list resurrection after DEL
- compactList documents that truncated scans compact what is available
  and let the next tick handle the remainder

Consistency documentation:
- allocateCommitTS: document that the timestamp flows through
  OperationGroup.CommitTS to the Coordinator which observes it in HLC,
  ensuring monotonicity and Raft-commit-order alignment
- resolveListMeta: document that both GetAt and ScanAt use the same
  MVCC readTS snapshot, guaranteeing point-in-time consistency
- ListDeltaCompactor: already uses Coordinator.Dispatch (Raft-committed)
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

[Data Loss] レビュー結果

PR #485: List Delta 設計実装

データロストに関連する問題を 1件 発見しました。


[高] listDeltaKeys がスキャンエラーを無視 → デルタキー孤立によるデータ二重計上

ファイル: adapter/redis.go

func (t *txnContext) listDeltaKeys(userKey []byte) [][]byte {
    prefix := store.ListMetaDeltaScanPrefix(userKey)
    deltas, err := t.server.store.ScanAt(...)
    if err != nil {
        return nil  // エラーを無視して空スライスを返す
    }
    ...
}

問題のシナリオ:

  1. リストにベースメタ () + N 個のデルタキーが存在する状態
  2. の「ベースメタ書き込みパス」が実行される(LSET 等によるリビルド)
  3. で全デルタを集計した正しい Len を含むベースメタを生成・書き込み
  4. → が エラーで を返す
  5. デルタキーが削除されないまま残存(孤立デルタ)
  6. 以降の がベースメタ(全デルタ加算済み)+ 孤立デルタを加算 → Len が二重計上され実際のアイテム数より大きくなる

同様に も ScanAt エラー時に claim キーのクリーンアップをスキップします( のガード)。
削除パス () は の無限ループで安全ですが、リビルドパスは上記のバグがあります。

影響:

  • が実際より大きい値を返す
  • / / がヘッド/テール位置を誤計算し、存在しないインデックスへのアクセスや誤ったアイテムの返却が発生する

修正案:
のシグネチャを に変更してエラーを まで伝播させるか、エラー時は を失敗させる。


[低] appendDeltaDelOps は 10,000 件上限あり(理論上のリスク)

は で打ち切られます。コンパクタが正常稼働していれば 64 件超で圧縮されるため通常は問題ありませんが、コンパクタが繰り返し失敗してデルタが 10,000 件超に達した場合、リビルド後も余剰デルタが残り二重計上が発生します。削除パスが (無限ループ)を使うのと非対称です。

bootjp added 6 commits April 13, 2026 04:19
Add a comprehensive status table covering all phases (L1-L5, H1-H4,
S1-S4, Z1-Z4, X1-X2) with per-item completion status. This makes it
easy to see what has been implemented and what remains.
Known issue: LLEN returns 4 after DEL+RPUSH in MULTI/EXEC due to old
delta keys not being cleaned up. Full fix requires buildListElems delta
integration on the leader node.
… handling

- buildListElems now writes delta keys instead of base metadata,
  eliminating the dual-path consistency issue (Jepsen G0-realtime)
- appendDeltaClaimDelOps returns errors instead of silently ignoring
  scan failures, preventing orphaned delta keys (addresses @claude
  Data Loss review)
- commit() pre-allocates commitTS for delta key generation
- Remove unused appendDeltaDelOps/listDeltaKeys

Known: TestRedis_MultiExec_DelThenRPushRecreatesList still fails (LLEN=4)
due to old deltas not being cleaned up in the purge path. Root cause
under investigation.
… cleanup issue)

The MULTI/EXEC DEL+RPUSH test fails because the purge path does not
clean up old delta keys from prior standalone RPUSH operations. This
is a known limitation of the current delta integration that needs
further investigation into the LeaderRoutedStore scan behavior during
MULTI/EXEC processing.
Revert buildListElems to write base metadata (not delta) for MULTI/EXEC
transactions. The delta-emit approach (0c33888) caused Jepsen
G0-realtime anomalies because commitTS pre-allocation in commit()
interacted poorly with the coordinator's timestamp assignment.

The base-meta + delta-cleanup approach (4cf3a8e) passed Jepsen and
correctly handles the dual-path by atomically deleting existing deltas
when writing base metadata.

Also re-enable TestRedis_MultiExec_DelThenRPushRecreatesList which now
passes with the base-meta write path.
The root cause of the Jepsen G0-realtime anomaly: maxLatestCommitTS
only checked base metadata keys (!lst|meta|<key>), but standalone
RPUSH writes only delta keys (!lst|meta|d|...) and item keys. This
caused txnStartTS to return a stale timestamp that predated recent
standalone RPUSH commits.

When MULTI/EXEC used this stale startTS, resolveListMeta missed
recent deltas, leading to incorrect base metadata writes and delta
cleanup that erased contributions from concurrent standalone RPUSHes.

Fix: appendLatestDeltaKeys reverse-scans the delta key prefix for
each list key in the transaction to find the most recent delta commit,
ensuring startTS accounts for all standalone RPUSH operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants