Adding a design document for delta-forming collection metadata.#485
Adding a design document for delta-forming collection metadata.#485
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new design for list metadata using a Delta and Claim pattern to eliminate write-write conflicts during high-concurrency operations like RPUSH and LPOP. The design shifts from a Read-Modify-Write model on a single metadata key to appending unique delta entries and using CAS-based claim keys for item removal, supported by a background compaction worker. The review feedback highlights critical improvements needed for key layout to prevent prefix collisions, strategies for handling delta scan limits to ensure data integrity, and the necessity of bounding transactions and migration plans to ensure operational stability.
docs/list_metadata_delta_design.md
Outdated
| func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte { | ||
| buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+8+4) | ||
| buf = append(buf, ListMetaDeltaPrefix...) | ||
| buf = append(buf, userKey...) | ||
| var ts [8]byte | ||
| binary.BigEndian.PutUint64(ts[:], commitTS) | ||
| buf = append(buf, ts[:]...) | ||
| var seq [4]byte | ||
| binary.BigEndian.PutUint32(seq[:], seqInTxn) | ||
| buf = append(buf, seq[:]...) | ||
| return buf | ||
| } |
There was a problem hiding this comment.
The Delta key layout is susceptible to prefix collisions. For example, a prefix scan for userKey = "foo" will incorrectly include keys for userKey = "foobar". To prevent this, the userKey should be followed by a separator (e.g., a null byte) or its length should be encoded before the fixed-length suffix.
| func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte { | |
| buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+8+4) | |
| buf = append(buf, ListMetaDeltaPrefix...) | |
| buf = append(buf, userKey...) | |
| var ts [8]byte | |
| binary.BigEndian.PutUint64(ts[:], commitTS) | |
| buf = append(buf, ts[:]...) | |
| var seq [4]byte | |
| binary.BigEndian.PutUint32(seq[:], seqInTxn) | |
| buf = append(buf, seq[:]...) | |
| return buf | |
| } | |
| func ListMetaDeltaKey(userKey []byte, commitTS uint64, seqInTxn uint32) []byte { | |
| buf := make([]byte, 0, len(ListMetaDeltaPrefix)+len(userKey)+1+8+4) | |
| buf = append(buf, ListMetaDeltaPrefix...) | |
| buf = append(buf, userKey...) | |
| buf = append(buf, 0) // Separator to prevent prefix collisions | |
| var ts [8]byte | |
| binary.BigEndian.PutUint64(ts[:], commitTS) | |
| buf = append(buf, ts[:]...) | |
| var seq [4]byte | |
| binary.BigEndian.PutUint32(seq[:], seqInTxn) | |
| buf = append(buf, seq[:]...) | |
| return buf | |
| } |
docs/list_metadata_delta_design.md
Outdated
| func ListClaimKey(userKey []byte, seq int64) []byte { | ||
| var raw [8]byte | ||
| encodeSortableInt64(raw[:], seq) | ||
| buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+8) | ||
| buf = append(buf, ListClaimPrefix...) | ||
| buf = append(buf, userKey...) | ||
| buf = append(buf, raw[:]...) | ||
| return buf | ||
| } |
There was a problem hiding this comment.
The Claim key layout suffers from the same prefix collision risk as the Delta keys. Disambiguate the userKey from the sequence suffix using a separator or length encoding to ensure that operations on one key do not affect another key that shares its prefix.
| func ListClaimKey(userKey []byte, seq int64) []byte { | |
| var raw [8]byte | |
| encodeSortableInt64(raw[:], seq) | |
| buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+8) | |
| buf = append(buf, ListClaimPrefix...) | |
| buf = append(buf, userKey...) | |
| buf = append(buf, raw[:]...) | |
| return buf | |
| } | |
| func ListClaimKey(userKey []byte, seq int64) []byte { | |
| var raw [8]byte | |
| encodeSortableInt64(raw[:], seq) | |
| buf := make([]byte, 0, len(ListClaimPrefix)+len(userKey)+1+8) | |
| buf = append(buf, ListClaimPrefix...) | |
| buf = append(buf, userKey...) | |
| buf = append(buf, 0) // Separator | |
| buf = append(buf, raw[:]...) | |
| return buf | |
| } |
docs/list_metadata_delta_design.md
Outdated
| @@ -0,0 +1,384 @@ | |||
| # List Metadata Delta Design | |||
There was a problem hiding this comment.
The design document should detail potential mitigation strategies for rolling upgrades, such as implementing a temporary 'bridge' or 'proxy' mode, to address the operational risk of transitioning to the new Delta layout.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
docs/list_metadata_delta_design.md
Outdated
| @@ -0,0 +1,384 @@ | |||
| # List Metadata Delta Design | |||
There was a problem hiding this comment.
The plan for migrating to the new metadata layout must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the transition.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
docs/list_metadata_delta_design.md
Outdated
|
|
||
| // 2. Fetch Deltas via prefix scan | ||
| prefix := ListMetaDeltaScanPrefix(userKey) | ||
| deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS) |
There was a problem hiding this comment.
docs/list_metadata_delta_design.md
Outdated
| 3. Loop: | ||
| a. Check for Claim key at !lst|claim|<key><candidateSeq> | ||
| b. If exists: candidateSeq++ and retry (Already claimed by another POP) | ||
| c. If not exists: |
There was a problem hiding this comment.
docs/list_metadata_delta_design.md
Outdated
| 2. Claim keys with a sequence less than Head are no longer needed (Head has already passed them). | ||
| 3. Within the compaction transaction: | ||
| - Advance the base meta Head by the number of claimed items. | ||
| - Delete corresponding Claim and Item keys. |
There was a problem hiding this comment.
docs/list_metadata_delta_design.md
Outdated
|
|
||
| Performing synchronous compaction during a write could cause write conflicts on the base metadata for the compaction transaction itself, introducing retries to what should be a conflict-free `PUSH` path. Read latency degradation due to Delta accumulation will be managed by tuning `scanInterval` and `maxDeltaCount` for Background Compaction. | ||
|
|
||
| If Delta accumulation becomes exceptionally high, a warning log will be emitted on the read side to allow operators to adjust compaction parameters. |
There was a problem hiding this comment.
Relying only on warning logs for high delta accumulation is risky. Following the general rule of applying fixed bounds to collections, the system should enforce a hard limit on unapplied deltas. When reached, the system should either reject new writes or force a synchronous compaction to protect read performance.
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.
There was a problem hiding this comment.
Pull request overview
Adds a design document describing a “delta + claim key” approach for list metadata updates to reduce write conflicts in Redis list operations within the KV/MVCC storage model.
Changes:
- Document new key layout for list metadata deltas (
!lst|meta|d|...) and claim keys (!lst|claim|...). - Describe write/read paths using read-time delta aggregation and conflict-free delta writes.
- Propose a background compactor phase to fold deltas (and GC claims) into base list metadata.
docs/list_metadata_delta_design.md
Outdated
| - `commitTS` is an 8-byte big-endian timestamp determined at Raft apply time. | ||
| - `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`). | ||
| - Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans. | ||
|
|
There was a problem hiding this comment.
The design says commitTS is determined at Raft apply time, but in the current architecture commitTS is resolved by the coordinator during dispatch (see kv.OperationGroup.CommitTS). Since the Delta key embeds commitTS, the doc should clarify how the write path obtains it before generating the key (e.g., explicitly pin CommitTS up front, or have the FSM rewrite Delta keys at apply time).
| - `commitTS` is an 8-byte big-endian timestamp determined at Raft apply time. | |
| - `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`). | |
| - Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans. | |
| - `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (for example on `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time. | |
| - `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`). | |
| - Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans. | |
| Because the Delta key embeds `commitTS`, the write path must know the final timestamp before emitting the key bytes. This design therefore assumes `CommitTS` is explicitly allocated once during dispatch and reused during Raft apply; it does not rely on the FSM rewriting Delta keys at apply time. |
docs/list_metadata_delta_design.md
Outdated
|
|
||
| // 2. Fetch Deltas via prefix scan | ||
| prefix := ListMetaDeltaScanPrefix(userKey) | ||
| deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS) | ||
|
|
||
| // 3. Aggregate | ||
| for _, d := range deltas { | ||
| delta := UnmarshalListMetaDelta(d.Value) | ||
| baseMeta.Head += delta.HeadDelta | ||
| baseMeta.Len += delta.LenDelta |
There was a problem hiding this comment.
The resolveListMeta() code block reads like concrete Go code but currently drops error handling (err is ignored after both loadListMetaAt and ScanAt) and references prefixEnd, while the codebase helper is prefixScanEnd (e.g., adapter/prefix_scan.go). Either add the missing error checks + correct helper name, or mark the block explicitly as simplified pseudocode to avoid an implementer copying a broken version.
| // 2. Fetch Deltas via prefix scan | |
| prefix := ListMetaDeltaScanPrefix(userKey) | |
| deltas, err := r.store.ScanAt(ctx, prefix, prefixEnd(prefix), maxDeltaScanLimit, readTS) | |
| // 3. Aggregate | |
| for _, d := range deltas { | |
| delta := UnmarshalListMetaDelta(d.Value) | |
| baseMeta.Head += delta.HeadDelta | |
| baseMeta.Len += delta.LenDelta | |
| if err != nil { | |
| return ListMeta{}, false, err | |
| } | |
| // 2. Fetch Deltas via prefix scan | |
| prefix := ListMetaDeltaScanPrefix(userKey) | |
| deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS) | |
| if err != nil { | |
| return ListMeta{}, false, err | |
| } | |
| // 3. Aggregate | |
| for _, d := range deltas { | |
| delta := UnmarshalListMetaDelta(d.Value) | |
| baseMeta.Head += delta.HeadDelta | |
| baseMeta.Len += delta.LenDelta |
docs/list_metadata_delta_design.md
Outdated
| Accumulated Claim keys do not affect read performance (they use the `!lst|claim|` prefix and are outside the metadata scan scope). GC is handled by the Background Compactor. | ||
|
|
There was a problem hiding this comment.
The statement that accumulated Claim keys “do not affect read performance” conflicts with the later plan to “Skip claimed items” by checking Claim keys during fetchListRange() (which would add extra reads/lookups). Please clarify the intended read-time strategy (e.g., range-scan Claim keys in bulk, embed claim status elsewhere, or accept the additional read amplification).
| Accumulated Claim keys do not affect read performance (they use the `!lst|claim|` prefix and are outside the metadata scan scope). GC is handled by the Background Compactor. | |
| Read-time strategy: | |
| - Claim keys are outside the `!lst|meta|` namespace, so they do not affect the metadata-only read path. | |
| - However, `fetchListRange()` must skip logically deleted items. To do that, it should **range-scan Claim keys in bulk** for the candidate sequence interval being materialized, then filter claimed sequences in memory while assembling the result. | |
| - This means Claim keys do introduce bounded read amplification for list reads, but it is **one additional range scan per fetched window**, not one extra point lookup per item. | |
| - Background Compaction keeps this bounded by deleting Claim keys whose sequence is below the effective Head and by collapsing old deltas. | |
| So the intended guarantee is: accumulated Claim keys do not affect metadata scans, but they can affect `fetchListRange()` until compaction removes obsolete claims. |
docs/list_metadata_delta_design.md
Outdated
| - Scan the entire `!lst|meta|d|` prefix every `scanInterval`. | ||
| - If the number of Deltas for a `userKey` exceeds `maxDeltaCount`, mark it for compaction. | ||
| - Compaction is performed as a transaction (`IsTxn: true`), protecting the base metadata read via the `readSet` (using OCC to prevent concurrent compaction conflicts). |
There was a problem hiding this comment.
“Scan the entire !lst|meta|d| prefix every scanInterval” has unbounded cost as Delta volume grows (O(total deltas) work every interval), which could become a background read hotspot. Consider documenting a more scalable trigger mechanism (e.g., incremental cursored scan, per-shard partitioning, maintaining per-list delta counters, or sampling) so operators don’t face a sudden compactor-induced load spike at scale.
|
@copilot apply changes based on the comments in this thread |
Applied all four review comments in commit 69f4220:
|
|
@copilot apply changes based on the comments in #485 (review) |
Applied all changes from gemini review 4093919615 in commit cf57136:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design for conflict-free list operations using a Delta pattern for metadata updates and a Claim mechanism for concurrent pops. The review feedback highlights several critical areas for refinement: ensuring PUSH operations clear stale claim keys to prevent list corruption, ensuring the compactor only advances the list Head through contiguous claimed sequences to avoid data loss, and explicitly defining the RPOP flow and garbage collection logic for the Tail of the list to support symmetric operations and prevent resource leaks.
docs/list_metadata_delta_design.md
Outdated
| 1. Read !lst|meta|<key> ← Necessary (for seq calculation), but NOT registered in readSet | ||
| 2. Scan !lst|meta|d|<key> ← Read unapplied deltas to recalculate head/len | ||
| 3. Put !lst|itm|<key><seq> ... ← Item write (unique key) | ||
| 4. Put !lst|meta|d|<key><commitTS><seqInTxn> ← Delta write (unique key) |
There was a problem hiding this comment.
The PUSH flow should explicitly include a deletion of the claim key for the sequence being occupied. Since sequences are recycled after POP operations, a stale claim key might exist for the sequence being reused. If not removed during the PUSH operation, a subsequent POP will see the stale claim and incorrectly skip the item, leading to orphaned items and list corruption. Additionally, as this design involves a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., dual-write proxy) to avoid service interruption.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
docs/list_metadata_delta_design.md
Outdated
| 3. Within the compaction transaction (bounded to at most `maxKeysPerCompactionTx` deletions | ||
| to avoid Raft proposal timeouts or LSM performance issues; suggested default: 256, | ||
| chosen to keep proposal sizes well under the typical 1 MiB Raft entry limit): | ||
| - Advance the base meta Head by the number of claimed items. |
There was a problem hiding this comment.
When advancing the base metadata Head during compaction, the compactor must only proceed through contiguous claimed sequences starting from the current Head. Advancing the Head past an unclaimed sequence (a gap) would logically delete items that have not been popped, causing data loss. Since this identifies a significant operational risk, the design must also detail potential mitigation strategies, such as implementing a temporary "bridge" or "proxy" mode, to ensure safety during rolling upgrades.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
docs/list_metadata_delta_design.md
Outdated
| #### 6.2. POP Claim Flow (LPOP example) | ||
|
|
||
| ``` | ||
| For LPOP: |
There was a problem hiding this comment.
The design should explicitly define the RPOP claim flow. Unlike LPOP which searches forward from Head, RPOP must search backward from Tail - 1. This requires a reverse range scan for claims within the window, and the logic for picking the candidate sequence differs from the LPOP example provided. Clarifying this will prevent implementation errors regarding sequence selection.
docs/list_metadata_delta_design.md
Outdated
|
|
||
| ``` | ||
| 1. Determine the base meta Head for the target userKey. | ||
| 2. Claim keys with a sequence less than Head are no longer needed (Head has already passed them). |
There was a problem hiding this comment.
Garbage collection of claim keys should also handle the right side of the list. Items and claims with a sequence greater than or equal to the effective Tail (resulting from RPOP operations) are obsolete and should be cleaned up by the compactor, similar to how sequences less than Head are handled. Restricting GC to only sequences less than Head will lead to resource leaks for RPOP-heavy workloads.
…review - Address PR #485 gemini review (4094524421): - Add stale claim key cleanup in PUSH write path - Enforce contiguous-only Head/Tail advancement in compaction - Add explicit RPOP claim flow (reverse scan from Tail-1) - Add Tail-side GC for RPOP claim keys - Extend delta metadata pattern to Hash, Set, and ZSet: - Hash: wide-column decomposition + delta - Set: wide-column decomposition + delta - ZSet: delta on existing wide-column (PR #483) - Unified compactor, shared delta limits, transition plan per type - Rename list_metadata_delta_design.md to collection_metadata_delta_design.md
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design for a Delta-based metadata pattern for Redis collections (List, Hash, Set, and ZSet) to mitigate write-write conflicts. The design incorporates wide-column decomposition and a claim mechanism for List operations. Feedback identifies critical issues regarding binary safety in key layouts, potential race conditions during the deletion of base metadata in background compaction, and missing truncation checks in the List metadata resolution logic.
| > **Note on separator**: The null byte (`\x00`) between `userKey` and the fixed-length suffix prevents prefix-collision bugs where scanning for `userKey = "foo"` would incorrectly include keys for `userKey = "foobar"`. | ||
|
|
||
| - `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (via `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time. | ||
| - `seqInTxn` is a 4-byte big-endian sequence number within the same transaction (needed if `LPUSH` is called multiple times for the same key in one `MULTI/EXEC`). | ||
| - Since all Delta keys for a `userKey` share the prefix `!lst|meta|d|<userKey>`, they are physically contiguous in the LSM tree, allowing for fast Prefix Scans. |
There was a problem hiding this comment.
The use of a null byte (\x00) separator for List Delta keys is inconsistent with the length-prefix approach (userKeyLen) adopted for Hash, Set, and ZSet types. While a null byte prevents collisions for standard strings, it is insufficient for binary-safe keys because the key itself can contain null bytes. Furthermore, changing the key format presents a significant operational risk for rolling upgrades and requires a clear migration strategy. The design must detail mitigation strategies (e.g., a 'bridge' mode) and a plan for live, zero-downtime cutover (e.g., dual-write proxy or logical replication) to avoid service interruption during the transition.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| For all collection types, empty collection deletion is deferred to Background Compaction (same reasoning as List Section 11.3): | ||
|
|
||
| - Immediate deletion would require writing to the base metadata, risking inconsistency with concurrent Delta writes. | ||
| - When the compactor detects `Len == 0`, it atomically deletes all keys for that collection (base metadata, deltas, data keys, and for List: claim keys). |
There was a problem hiding this comment.
Deleting the baseMeta and data keys (items, fields, members) during background compaction when Len == 0 is dangerous in this lock-free design. Since writers do not register the baseMeta key in their readSet, a concurrent pusher might calculate a sequence number based on the existing metadata and unapplied deltas. If the compactor deletes the baseMeta before the pusher commits, the next pusher will see no metadata and start sequence numbering from zero, leading to sequence collisions or a corrupted collection structure. This represents a significant operational risk; the design must detail mitigation strategies to maintain correctness, such as updating the baseMeta to reflect Len=0 rather than deleting the key entirely, or implementing a safety bridge for concurrent operations.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
| deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS) | ||
| if err != nil { | ||
| return ListMeta{}, false, err | ||
| } | ||
|
|
||
| // 3. Aggregate |
There was a problem hiding this comment.
The resolveListMeta pseudocode is missing the truncation check for maxDeltaScanLimit. As noted in Section 4 and implemented in the Hash version (Section 12), the function must return an error if len(deltas) == maxDeltaScanLimit to prevent returning incorrect metadata when deltas are truncated.
| deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS) | |
| if err != nil { | |
| return ListMeta{}, false, err | |
| } | |
| // 3. Aggregate | |
| deltas, err := r.store.ScanAt(ctx, prefix, prefixScanEnd(prefix), maxDeltaScanLimit, readTS) | |
| if err != nil { | |
| return ListMeta{}, false, err | |
| } | |
| if len(deltas) == maxDeltaScanLimit { | |
| return ListMeta{}, false, ErrDeltaScanTruncated | |
| } | |
| // 3. Aggregate |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design for a Delta-based metadata management system for Redis collections, aimed at eliminating write conflicts during concurrent operations across List, Hash, Set, and ZSet types. The design utilizes wide-column decomposition and a Claim mechanism to ensure conflict-free throughput. Review feedback identifies several critical issues, including potential livelocks during synchronous compaction, the risk of Delta key collisions if commitTS uniqueness is not guaranteed, and concerns regarding Redis compatibility when using tombstones for empty collections. Additionally, the reviewer suggested optimizing performance for sequence checks via range scans and requested more detailed strategies for zero-downtime migrations.
|
|
||
| When the number of Deltas is small (< 100), the cost of a Prefix Scan is negligible. Since Delta keys are physically contiguous in the LSM tree, I/O can be performed in a single sequential read. | ||
|
|
||
| **`maxDeltaScanLimit` overflow**: If the number of unapplied Deltas exceeds `maxDeltaScanLimit`, `resolveListMeta` cannot aggregate them all in a single scan pass, which would produce an incorrect `ListMeta`. To preserve correctness, `resolveListMeta` must return an error when the scan result is truncated (i.e., when `len(deltas) == maxDeltaScanLimit`). The caller should then either surface the error or trigger an immediate synchronous compaction before retrying. This behaviour is the enforcement backstop for the hard-limit policy described in Section 11.1. |
There was a problem hiding this comment.
Triggering an immediate synchronous compaction when maxDeltaScanLimit is reached poses a risk of livelock under high concurrency. Since multiple writers hitting the limit will all attempt to compact the same metadata key, they will likely conflict with each other on the baseMeta write. As this represents a significant operational risk, the design must detail potential mitigation strategies, such as serializing these compaction attempts or implementing a backoff strategy, to ensure system progress.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
|
|
||
| - Immediate deletion would require writing to the base metadata, risking inconsistency with concurrent Delta writes. | ||
| - When the compactor detects `Len == 0`, it performs the following in a single transaction: | ||
| 1. **Update** the base metadata to `Len = 0` (and for List: `Head`/`Tail` reflecting the final state). The base metadata key is **not deleted** — it is retained as a tombstone to prevent concurrent writers from misinterpreting a missing metadata key as a fresh collection and restarting sequence numbering from zero. |
There was a problem hiding this comment.
Retaining the base metadata key as a tombstone with Len = 0 deviates from standard Redis behavior. Furthermore, the 'restart from zero' approach mentioned here must be evaluated against production migration requirements. For a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green or dual-write proxy) to avoid service interruption, rather than relying on clearing state.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| ``` | ||
|
|
||
| - `userKeyLen` is a 4-byte big-endian length prefix that unambiguously separates `userKey` from the fixed-length suffix, even when `userKey` contains arbitrary bytes (including null bytes). This is the same approach used by Hash, Set, and ZSet key layouts, ensuring consistency across all collection types and full binary safety. | ||
| - `commitTS` is an 8-byte big-endian timestamp pinned by the coordinator before the Delta key is generated (via `kv.OperationGroup.CommitTS` during dispatch), then carried through Raft and used unchanged at apply time. |
There was a problem hiding this comment.
The design relies on commitTS being part of the Delta key to avoid write conflicts. However, if multiple coordinators can concurrently process requests for the same userKey, there is a risk of Delta key collisions if they happen to allocate the same commitTS. While HLCs generally provide uniqueness, the document should explicitly state the requirement for commitTS to be globally unique per transaction (or at least per userKey) to prevent silent overwrites of Delta entries.
| 3. For each target sequence, check for stale Claim keys: | ||
| - Scan !lst|claim|<keyLen><key>[seq …] for sequences being written | ||
| - If a stale Claim key exists (left over from a prior POP on a recycled sequence), | ||
| emit a Del for that Claim key in the same transaction |
There was a problem hiding this comment.
The "New Flow" for RPUSH mentions checking for stale Claim keys for each target sequence. To maintain high throughput, this check should be implemented as a single range scan covering all sequences in the batch (e.g., [Tail, Tail + numItems)) rather than individual point lookups or scans per item. This minimizes the number of round-trips to the storage engine.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request outlines a design for implementing a Delta pattern across Redis collection types to mitigate write conflicts and improve concurrency. The design includes wide-column decomposition for Hash and Set types and a CAS-based Claim mechanism for List operations. Review feedback suggests refining the LPOP and RPOP claim loops with explicit boundary checks to ensure termination and addressing a potential race condition in the background compactor's bulk deletion logic for Hash, Set, and ZSet data keys.
| - Put !lst|claim|<keyLen><key><candidateSeq> → {claimerTS} (Write Claim) | ||
| - Put !lst|meta|d|<keyLen><key><commitTS><seqInTxn(4)> → {HeadDelta: +1, LenDelta: -1} | ||
| - Commit via dispatchElems() | ||
| If no candidate found in window: advance window and repeat from step 3 |
There was a problem hiding this comment.
The LPOP claim loop should explicitly check the candidateSeq against meta.Tail to ensure it does not scan beyond the current logical end of the list or enter an infinite loop if all items in the range are claimed.
| If no candidate found in window: advance window and repeat from step 3 | |
| If no candidate found in window: advance window and repeat from step 3 (stop if candidateSeq >= meta.Tail) |
| - Put !lst|claim|<keyLen><key><candidateSeq> → {claimerTS} (Write Claim) | ||
| - Put !lst|meta|d|<keyLen><key><commitTS><seqInTxn(4)> → {HeadDelta: 0, LenDelta: -1} | ||
| - Commit via dispatchElems() | ||
| If no candidate found in window: retreat window and repeat from step 3 |
There was a problem hiding this comment.
Similarly to LPOP, the RPOP claim loop should check candidateSeq against meta.Head to ensure it terminates if no unclaimed items are found within the logical bounds of the list.
| If no candidate found in window: retreat window and repeat from step 3 | |
| If no candidate found in window: retreat window and repeat from step 3 (stop if candidateSeq < meta.Head) |
| - When the compactor detects `Len == 0`, it performs the following in a single transaction: | ||
| 1. **Update** the base metadata to `Len = 0` (and for List: `Head`/`Tail` reflecting the final state). The base metadata key is **not deleted** — it is retained as a tombstone to prevent concurrent writers from misinterpreting a missing metadata key as a fresh collection and restarting sequence numbering from zero. | ||
| 2. Delete all applied Delta keys. | ||
| 3. Delete all data keys (items for List, fields for Hash, members for Set, member+score keys for ZSet). |
There was a problem hiding this comment.
Bulk deletion of data keys for Hash, Set, and ZSet during background compaction introduces a race condition. Since writers (HSET, SADD, ZADD) do not register the base metadata in their readSet, the compactor could perform a prefix delete that removes keys added by concurrent transactions that have not yet been merged into the base metadata. For these types, it is safer to rely on the individual mutation operations (HDEL, SREM, ZREM) which already handle data key deletion.
| 3. Delete all data keys (items for List, fields for Hash, members for Set, member+score keys for ZSet). | |
| 3. For List: delete corresponding Item keys for the contiguous range only. For Hash, Set, and ZSet, data keys are already deleted by mutation operations (HDEL/SREM/ZREM); the compactor should avoid bulk prefix deletion to prevent races with concurrent writers. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design for a Delta-based metadata pattern across all Redis collection types to eliminate write conflicts during concurrent operations. The proposal includes wide-column decomposition for Hash and Set types and a CAS-based claim mechanism for List POP operations to ensure conflict-free throughput. The review feedback identifies critical integer overflow vulnerabilities in key extraction functions and points out inconsistencies regarding data key cleanup for empty collections. Additionally, the feedback emphasizes the need for a detailed migration and zero-downtime cutover strategy to mitigate operational risks associated with changes to the base metadata key format.
| Base Metadata (Existing): | ||
| !lst|meta|<userKeyLen(4)><userKey> → [Head(8)][Tail(8)][Len(8)] |
There was a problem hiding this comment.
The document labels the base metadata as (Existing) but describes a new key format !lst|meta|<userKeyLen(4)><userKey>. The current implementation in store/list_helpers.go uses !lst|meta|<userKey>. This change in key format introduces a significant operational risk for rolling upgrades. Per repository rules, please detail potential mitigation strategies (e.g., a temporary "bridge" or "proxy" mode) and strategies for zero-downtime cutover (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption during the migration.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| func ExtractListUserKeyFromDelta(key []byte) []byte { | ||
| trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix)) | ||
| if len(trimmed) < 4+8+4 { // 4(userKeyLen) + 8(commitTS) + 4(seqInTxn) | ||
| return nil | ||
| } | ||
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | ||
| if uint32(len(trimmed)) < 4+ukLen+8+4 { | ||
| return nil | ||
| } | ||
| return trimmed[4 : 4+ukLen] | ||
| } |
There was a problem hiding this comment.
The length check if uint32(len(trimmed)) < 4+ukLen+8+4 is vulnerable to an integer overflow if ukLen is a very large uint32. This can lead to a panic due to an out-of-bounds slice access. A safer way to perform this check is to avoid arithmetic that can overflow.
| func ExtractListUserKeyFromDelta(key []byte) []byte { | |
| trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix)) | |
| if len(trimmed) < 4+8+4 { // 4(userKeyLen) + 8(commitTS) + 4(seqInTxn) | |
| return nil | |
| } | |
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | |
| if uint32(len(trimmed)) < 4+ukLen+8+4 { | |
| return nil | |
| } | |
| return trimmed[4 : 4+ukLen] | |
| } | |
| func ExtractListUserKeyFromDelta(key []byte) []byte { | |
| trimmed := bytes.TrimPrefix(key, []byte(ListMetaDeltaPrefix)) | |
| const suffixLen = 8 + 4 // commitTS + seqInTxn | |
| if len(trimmed) < 4+suffixLen { // 4(userKeyLen) + suffix | |
| return nil | |
| } | |
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | |
| // Check if the declared userKey length would exceed the available buffer length. | |
| if ukLen > uint32(len(trimmed)-(4+suffixLen)) { | |
| return nil | |
| } | |
| return trimmed[4 : 4+ukLen] | |
| } |
| func ExtractListUserKeyFromClaim(key []byte) []byte { | ||
| trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix)) | ||
| if len(trimmed) < 4+8 { // 4(userKeyLen) + 8(seq) | ||
| return nil | ||
| } | ||
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | ||
| if uint32(len(trimmed)) < 4+ukLen+8 { | ||
| return nil | ||
| } | ||
| return trimmed[4 : 4+ukLen] | ||
| } |
There was a problem hiding this comment.
Similar to ExtractListUserKeyFromDelta, the length check if uint32(len(trimmed)) < 4+ukLen+8 is vulnerable to an integer overflow if ukLen is large. This can cause a panic. The check should be rewritten to be safe against overflows.
| func ExtractListUserKeyFromClaim(key []byte) []byte { | |
| trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix)) | |
| if len(trimmed) < 4+8 { // 4(userKeyLen) + 8(seq) | |
| return nil | |
| } | |
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | |
| if uint32(len(trimmed)) < 4+ukLen+8 { | |
| return nil | |
| } | |
| return trimmed[4 : 4+ukLen] | |
| } | |
| func ExtractListUserKeyFromClaim(key []byte) []byte { | |
| trimmed := bytes.TrimPrefix(key, []byte(ListClaimPrefix)) | |
| const suffixLen = 8 // seq | |
| if len(trimmed) < 4+suffixLen { // 4(userKeyLen) + suffix | |
| return nil | |
| } | |
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | |
| // Check if the declared userKey length would exceed the available buffer length. | |
| if ukLen > uint32(len(trimmed)-(4+suffixLen)) { | |
| return nil | |
| } | |
| return trimmed[4 : 4+ukLen] | |
| } |
| - Before deleting Deltas, the worker ensures their `commitTS` is older than `ActiveTimestampTracker.Oldest()` to avoid breaking in-flight reads. | ||
| - Deltas within the MVCC retention window are not deleted to guarantee consistency for historical reads. |
There was a problem hiding this comment.
The conditions for deleting deltas during compaction could be stated more clearly. The document lists two conditions: being older than ActiveTimestampTracker.Oldest() and being outside the MVCC retention window. It would be clearer to state that a delta is eligible for deletion only if its commitTS is older than both the oldest active read timestamp and the MVCC retention window boundary. For example: "To be eligible for deletion, a Delta's commitTS must be older than both ActiveTimestampTracker.Oldest() and the MVCC retention window boundary (now - retention_window)."
|
|
||
| Immediate cleanup will not occur even if `Len == 0` after aggregating Deltas. When Background Compaction detects `Len == 0`: | ||
| - The base metadata key is **updated** to reflect `Len = 0` (not deleted). This is critical because writers do not register the base metadata in their `readSet`. If the key were deleted, a concurrent writer would see no metadata and incorrectly assume a fresh collection, restarting sequence numbering from zero and causing sequence collisions or data corruption. | ||
| - All applied Deltas, data keys (items, fields, members, score keys), and Claim keys (for List) are deleted. |
There was a problem hiding this comment.
There's an inconsistency in the description of empty collection cleanup. This line states that "All applied Deltas, data keys (items, fields, members, score keys), and Claim keys (for List) are deleted" when Len == 0. However, other sections (e.g., Section 13 for Hashes) correctly state that data keys for Hash, Set, and ZSet are not deleted by the compactor to avoid race conditions. Please clarify this statement to reflect that data keys are only deleted for Lists during compaction. The more detailed explanation in Section 28 is correct and should be reflected here for consistency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design for a Delta + Claim pattern to resolve write conflicts on Redis collection metadata (List, Hash, Set, and ZSet). The design includes wide-column decomposition for Hash and Set types, a background compaction mechanism, and a detailed transition plan. Feedback focuses on correcting internal section references regarding delta limits, addressing a potential race condition in the lazy tombstone cleanup strategy for empty collections, and suggesting an additional helper function for user key extraction to maintain consistency with the new metadata format.
| } | ||
| return trimmed[4 : 4+ukLen] | ||
| } | ||
| ``` |
There was a problem hiding this comment.
The design introduces a new base metadata key format with a length prefix (!lst|meta|<userKeyLen(4)>). To maintain consistency and support the migration, an updated helper function for extracting the user key from the base metadata should be included in Section 8. Furthermore, as this is a production-grade system migration, the plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to avoid service interruption.
| ``` | |
| func ExtractListUserKeyFromBaseMeta(key []byte) []byte { | |
| trimmed := bytes.TrimPrefix(key, []byte(ListMetaPrefix)) | |
| if len(trimmed) < 4 { | |
| return nil | |
| } | |
| ukLen := binary.BigEndian.Uint32(trimmed[:4]) | |
| if uint32(len(trimmed)) < 4+ukLen { | |
| return nil | |
| } | |
| return trimmed[4 : 4+ukLen] | |
| } |
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| - `DEL key` / `UNLINK key`: performs full metadata deletion (including the base metadata key) via the standard transactional `DEL` flow with proper `readSet` registration. This is the only path that removes the base metadata key. | ||
| - On first write to a `Len = 0` tombstone (e.g., `RPUSH` to an empty list): the writer sees the existing metadata with `Len = 0` and correctly resumes sequence numbering from the current `Head`/`Tail`, avoiding the sequence-collision problem that would occur if the key had been fully deleted. | ||
|
|
||
| **Lazy tombstone cleanup**: To prevent indefinite accumulation of `Len = 0` tombstone metadata keys, the Background Compactor tracks how long a key has remained at `Len = 0`. If a tombstone persists beyond a configurable TTL (default: 24 hours) with no new writes, the compactor may perform a full deletion as a leader-only operation with `readSet` registration, behaving like an implicit `DEL`. |
There was a problem hiding this comment.
There is a potential race condition in the "Lazy tombstone cleanup" strategy. Since writers do not register the base metadata key in their readSet, a concurrent writer could start an operation after the compactor has decided to delete the tombstone but before the deletion is committed, leading to sequence collisions. This represents a significant operational risk. The design must detail potential mitigation strategies, such as implementing a temporary "bridge" or "proxy" mode, to ensure that no writers are active or to prevent sequence reuse if the metadata key is fully removed.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
|
|
||
| When the number of Deltas is small (< 100), the cost of a Prefix Scan is negligible. Since Delta keys are physically contiguous in the LSM tree, I/O can be performed in a single sequential read. | ||
|
|
||
| **`maxDeltaScanLimit` overflow**: If the number of unapplied Deltas exceeds `maxDeltaScanLimit`, `resolveListMeta` cannot aggregate them all in a single scan pass, which would produce an incorrect `ListMeta`. To preserve correctness, `resolveListMeta` must return an error when the scan result is truncated (i.e., when `len(deltas) == maxDeltaScanLimit`). The caller should then either surface the error or trigger an immediate synchronous compaction before retrying. This behaviour is the enforcement backstop for the hard-limit policy described in Section 11.1. |
|
|
||
| ### 27. Shared Delta Limits | ||
|
|
||
| The delta accumulation limits from List Section 11.1 apply uniformly to all collection types: |
|
/gemini review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design for a Delta + Claim pattern to resolve write conflicts on Redis collection metadata (List, Hash, Set, and ZSet). The design transitions Hash and Set to a wide-column format and uses individual Delta keys to avoid Read-Modify-Write (RMW) bottlenecks on a single metadata key. Feedback focuses on optimizing the scanWindow for LPOP operations to handle high-throughput scenarios, ensuring commitTS uniqueness during leader transitions, and addressing potential consistency issues in MULTI/EXEC transactions when metadata is read without readSet registration. Additionally, there are suggestions to refine the migration strategy for legacy data and to implement a lazy reaping mechanism for Len = 0 tombstones to prevent long-term metadata storage leaks.
| 1. resolveListMeta(key, readTS) → Effective meta (Determine Head, Len) | ||
| 2. candidateSeq = meta.Head | ||
| 3. Bulk-scan existing Claim keys in range [candidateSeq, candidateSeq+scanWindow): | ||
| - scanWindow is a configurable constant (default: 32) that determines how many |
There was a problem hiding this comment.
A scanWindow of 32 might be too small for high-throughput workloads, especially if background compaction lags. If thousands of items are popped but not yet compacted, LPOP could require many sequential range scans to find the next available item, leading to significant latency spikes. Consider increasing the default window size (e.g., to 128 or 256) to reduce read amplification, as Pebble range scans are highly efficient for these sizes.
| - scanWindow is a configurable constant (default: 32) that determines how many | |
| - scanWindow is a configurable constant (default: 128) that determines how many |
| a. First batch: write !hs|meta|<key> with Len, create first N field keys, keep legacy blob. | ||
| b. Subsequent batches: create next N field keys each. | ||
| c. Final batch: create remaining field keys, delete legacy !redis|hash|<key>. | ||
| - During batched migration, reads must merge wide-column fields with remaining legacy fields. |
There was a problem hiding this comment.
During the batched migration phase for Hash and Set, the design should explicitly define the behavior for concurrent writes (HSET, SADD) and deletes (HDEL, SREM). To ensure a zero-downtime cutover and avoid service interruption, the plan must detail strategies such as a dual-write proxy or logical replication. Specifically, writes should target the new wide-column format immediately, and the "merge" logic in reads must prioritize wide-column field keys over the legacy blob. Additionally, deletes must check both formats to avoid leaving orphaned data in the legacy blob.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| > **Note on base metadata format**: The base metadata key retains the existing format `!lst|meta|<userKey>` without a length prefix. No migration of the base metadata key is required. Only the newly introduced Delta and Claim keys use the `userKeyLen(4)` length-prefix convention (consistent with Hash, Set, and ZSet). This avoids any ambiguity in key parsing during rolling upgrades — the base metadata prefix `!lst|meta|` is distinct from the Delta prefix `!lst|dlt|`, so there is no collision risk. | ||
|
|
||
| - `userKeyLen` (in Delta/Claim keys only) is a 4-byte big-endian length prefix that unambiguously separates `userKey` from the fixed-length suffix, even when `userKey` contains arbitrary bytes (including null bytes). | ||
| - `commitTS` is an 8-byte big-endian timestamp allocated by the **Raft group leader during dispatch** (before the Raft proposal is created). **`commitTS` must be globally unique per transaction** — the combination of `commitTS + seqInTxn` forms the Delta key's identity, and a collision would silently overwrite a prior Delta entry, causing metadata corruption. In this system, all write operations are routed to the leader via `proxyToLeader()` — the leader IS the coordinator for writes. The leader allocates `commitTS` from its HLC (Hybrid Logical Clock) during dispatch, embeds it in the Delta key bytes, and includes the pre-built keys in the Raft proposal. At apply time, the FSM uses these keys as-is without rewriting. Since Raft's safety property ensures at most one leader exists at a time, and the leader's HLC monotonically advances, `commitTS` uniqueness is guaranteed without requiring a node identifier. |
There was a problem hiding this comment.
To guarantee commitTS uniqueness across leader transitions, the design should explicitly state that a new leader must initialize its HLC to a value strictly greater than the maximum commitTS present in its Raft log. Without this initialization, a new leader with a lagging system clock could potentially issue duplicate timestamps for the same user key, leading to silent metadata corruption in the LSM tree due to key collisions.
| } | ||
| ``` | ||
|
|
||
| - In `buildListElems()`, replace metadata `Put` with Delta `Put`. |
There was a problem hiding this comment.
The integration with MULTI/EXEC should clarify the consistency guarantees for transactions that read metadata (e.g., HLEN, LLEN) and then perform conditional logic. Since the design avoids readSet registration on the base metadata to improve throughput, such transactions may observe "phantom" updates if concurrent writers add deltas. If strict serializability is required for these cases, the design may need to support optional read-set tracking for the delta range or the base metadata key.
| - `DEL key` / `UNLINK key`: performs full deletion via the standard transactional `DEL` flow with proper `readSet` registration. This must delete **all** associated keys: base metadata, all Delta keys (`!*|dlt|` / `!*|meta|d|` namespace), all Claim keys (for List), and all data keys (items, fields, members, score keys). Without cleaning up Delta keys, a subsequently recreated collection could aggregate stale deltas from a prior incarnation. | ||
| - On first write to a `Len = 0` tombstone (e.g., `RPUSH` to an empty list): the writer sees the existing metadata with `Len = 0` and correctly resumes sequence numbering from the current `Head`/`Tail`, avoiding the sequence-collision problem that would occur if the key had been fully deleted. | ||
|
|
||
| **No lazy tombstone cleanup**: The `Len = 0` tombstone is retained indefinitely until an explicit `DEL` command removes it. Automatic deletion of the base metadata key by the compactor is intentionally **not** performed, because it introduces an unresolvable race condition: since writers do not register the base metadata in their `readSet`, a concurrent writer could commit a new Delta between the compactor's "no deltas exist" check and its deletion of the base metadata. Registering the delta prefix scan in the compactor's `readSet` would not help because the writer's delta key is unique and was not yet visible at scan time. The storage overhead of retaining one tombstone key (24 bytes for List, 8 bytes for Hash/Set/ZSet) per empty collection is negligible compared to the data integrity risk of automatic deletion. |
There was a problem hiding this comment.
While retaining Len = 0 tombstones is necessary for correctness during concurrent operations, an indefinite accumulation of these keys for transient collections could lead to a slow leak in metadata storage. Consider a "lazy reaping" mechanism where tombstones older than the MVCC retention window with no associated deltas can be safely removed during a specialized compaction pass or a full DEL scan.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/05304128-8703-4e0a-a431-5f1c46fbf27f Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/46c614ab-ff2b-4a09-a71c-ad4a48ea9194 Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
…review - Address PR #485 gemini review (4094524421): - Add stale claim key cleanup in PUSH write path - Enforce contiguous-only Head/Tail advancement in compaction - Add explicit RPOP claim flow (reverse scan from Tail-1) - Add Tail-side GC for RPOP claim keys - Extend delta metadata pattern to Hash, Set, and ZSet: - Hash: wide-column decomposition + delta - Set: wide-column decomposition + delta - ZSet: delta on existing wide-column (PR #483) - Unified compactor, shared delta limits, transition plan per type - Rename list_metadata_delta_design.md to collection_metadata_delta_design.md
…empty collection safety, add truncation check - Unify List delta/claim key format from null-byte separator to userKeyLen(4) length-prefix, consistent with Hash/Set/ZSet and binary-safe for keys containing null bytes - Fix empty collection compaction: update baseMeta to Len=0 instead of deleting it, preventing sequence collision when concurrent writers see missing metadata - Add missing maxDeltaScanLimit truncation check to resolveListMeta pseudocode, matching the Hash resolveHashMeta implementation
8d42195 to
ab2555b
Compare
- Increase LPOP scanWindow default from 32 to 128 to reduce read amplification under high-throughput POP workloads - Document HLC leader initialization invariant ensuring commitTS uniqueness across leader transitions - Clarify concurrent write behavior during Hash/Set legacy migration (atomic migration + OCC retry semantics) - Document MULTI/EXEC consistency guarantees for metadata reads (snapshot isolation, optional strict serializability opt-in) - Add lazy reaping mechanism for Len=0 tombstones to prevent metadata storage leak from transient collections
Add ListMetaDelta struct with fixed 16-byte binary marshal/unmarshal, Delta key helpers (ListMetaDeltaKey, ListMetaDeltaScanPrefix, IsListMetaDeltaKey, ExtractListUserKeyFromDelta), Claim key helpers (ListClaimKey, ListClaimScanPrefix, IsListClaimKey, ExtractListUserKeyFromClaim), and PrefixScanEnd utility. Fix IsListMetaKey to exclude Delta keys (both share the !lst|meta| prefix). Add comprehensive unit tests for all new helpers including ordering, prefix scan bounds, binary safety, and edge cases.
- Add resolveListMeta() that aggregates base metadata + unapplied deltas - Add allocateCommitTS() and dispatchElemsWithCommitTS() for Delta key gen - Modify buildRPushOps/buildLPushOps to emit Delta keys instead of base meta - Switch listRPush/listLPush to use resolveListMeta + pre-allocated commitTS - Update all read paths (rangeList, isListKeyAt, listValuesAt) to resolve deltas - Update rawKeyTypeAt to detect delta-only lists via resolveListMeta - Update deleteLogicalKeyElems to clean up delta and claim keys - MULTI/EXEC buildListElems still writes base metadata (delta support deferred) Known limitation: 2 MULTI/EXEC tests fail because the txn path writes base metadata while standalone ops write deltas, causing stale delta accumulation after DEL+RPUSH sequences. Full MULTI/EXEC delta support requires Phase L2 transaction integration.
…3 partial)" This reverts commit 31c53ba.
- Add resolveListMeta() that aggregates base metadata + unapplied deltas - Add allocateCommitTS() and dispatchElemsWithCommitTS() - Modify buildRPushOps/buildLPushOps to emit Delta keys - Switch listRPush/listLPush to resolveListMeta + pre-allocated commitTS - Switch all read paths to resolveListMeta: rangeList, listValuesAt, isListKeyAt, loadListState (lua context) - Add delta-only list detection in rawKeyTypeAt (fallback after point reads) - isListKeyAt uses fast two-stage check: point read then limit-1 scan - Add listDeleteAllElems helper for cleanup of delta/claim keys on DEL - MULTI/EXEC buildListElems still writes base metadata (delta deferred)
Implement a cursor-based incremental compactor that periodically scans the list delta key space and folds accumulated deltas into base metadata when the per-key delta count exceeds a configurable threshold. The compactor: - Scans !lst|meta|d| prefix incrementally (maxKeysPerTick per pass) - Groups delta keys by user key and checks against maxDeltaCount - Folds deltas into base metadata via a single OCC transaction - Deletes all applied delta keys atomically - Skips when not leader to avoid redundant work - Configurable interval, threshold, and timeout This enables the Delta pattern to run in production without unbounded delta accumulation degrading resolveListMeta performance.
[Data Loss] レビュー結果対象PR: #485 (List Delta Compactor + delta write path) 🔴 重大: DEL 実行後にリストが「復活」するリスク場所: // Delete delta keys.
deltaPrefix := store.ListMetaDeltaScanPrefix(key)
deltas, derr := r.store.ScanAt(ctx, deltaPrefix, store.PrefixScanEnd(deltaPrefix), maxDeltaScanLimit, readTS)
// ↑ maxDeltaScanLimit = 10000 件で打ち切られる
for _, d := range deltas {
elems = append(elems, &kv.Elem[kv.OP]{Op: kv.Del, Key: bytes.Clone(d.Key)})
}
// Claim keys も同様に 10000 件で打ち切られる
claims, cerr := r.store.ScanAt(ctx, claimPrefix, store.PrefixScanEnd(claimPrefix), maxDeltaScanLimit, readTS)DEL コマンドはベースメタキー ( 10000 件を超える delta または claim キーが存在するリストに対して DEL を実行すると、ベースメタは削除されるが一部の delta キーが孤立したまま残存する。その後 func (r *RedisServer) resolveListMeta(...) (store.ListMeta, bool, error) {
baseMeta, exists, err := r.loadListMetaAt(...) // → exists = false (削除済み)
deltas, err := r.store.ScanAt(...) // → 孤立 delta が残っている
...
return baseMeta, exists || len(deltas) > 0, nil // ← true になってしまう!
}結果: 削除したはずのリストが 発生条件: コンパクターが長時間停止した状態で高頻度 PUSH が続いた場合。通常はソフトリミット 64 件、ハードリミット 256 件で制御されるが、コンパクター異常時は蓄積が止まらない。 修正方針: 🟡
|
[Consistency] レビュー結果評価: 重大な懸念事項あり 問題1: Raftを通さないデルタキー生成HLC (Hybrid Logical Clock) から直接タイムスタンプを取得してデルタキーを生成する設計(
問題2:
|
…ase L5) - TestConcurrentRPush_EventualSuccess: concurrent RPUSH with retries on item key conflicts verifies eventual success - TestConcurrentLPush_EventualSuccess: same for LPUSH - TestConcurrentRPush_ThenCompact: data integrity after compaction - TestResolveListMeta_ScalesWithDeltaCount: correctness at 1/10/50/100 deltas Item key conflicts are expected under concurrent PUSH since writers compute sequences from stale snapshots. The Delta pattern eliminates metadata key conflicts; item key conflicts require retry at the caller.
When buildListElems writes base metadata (MULTI/EXEC path), atomically delete all existing delta keys for that list. Without this, standalone RPUSH deltas accumulate and get double-counted by resolveListMeta after MULTI/EXEC writes base metadata containing the aggregated values. Also clean up delta and claim keys in DEL and purge paths within MULTI/EXEC transactions. This fixes a Jepsen-detected consistency violation (G-single-item-realtime, incompatible-order) where elements appeared to be lost due to incorrect metadata after mixed standalone/MULTI operations.
Address @claude Data Loss and Consistency reviews: Data Loss fixes: - listDeleteAllElems now loops with batched scans to delete ALL delta and claim keys, preventing truncation at maxDeltaScanLimit (10000) that could leave orphaned deltas causing list resurrection after DEL - compactList documents that truncated scans compact what is available and let the next tick handle the remainder Consistency documentation: - allocateCommitTS: document that the timestamp flows through OperationGroup.CommitTS to the Coordinator which observes it in HLC, ensuring monotonicity and Raft-commit-order alignment - resolveListMeta: document that both GetAt and ScanAt use the same MVCC readTS snapshot, guaranteeing point-in-time consistency - ListDeltaCompactor: already uses Coordinator.Dispatch (Raft-committed)
[Data Loss] レビュー結果PR #485: List Delta 設計実装データロストに関連する問題を 1件 発見しました。 [高]
|
Add a comprehensive status table covering all phases (L1-L5, H1-H4, S1-S4, Z1-Z4, X1-X2) with per-item completion status. This makes it easy to see what has been implemented and what remains.
Known issue: LLEN returns 4 after DEL+RPUSH in MULTI/EXEC due to old delta keys not being cleaned up. Full fix requires buildListElems delta integration on the leader node.
… handling - buildListElems now writes delta keys instead of base metadata, eliminating the dual-path consistency issue (Jepsen G0-realtime) - appendDeltaClaimDelOps returns errors instead of silently ignoring scan failures, preventing orphaned delta keys (addresses @claude Data Loss review) - commit() pre-allocates commitTS for delta key generation - Remove unused appendDeltaDelOps/listDeltaKeys Known: TestRedis_MultiExec_DelThenRPushRecreatesList still fails (LLEN=4) due to old deltas not being cleaned up in the purge path. Root cause under investigation.
… cleanup issue) The MULTI/EXEC DEL+RPUSH test fails because the purge path does not clean up old delta keys from prior standalone RPUSH operations. This is a known limitation of the current delta integration that needs further investigation into the LeaderRoutedStore scan behavior during MULTI/EXEC processing.
Revert buildListElems to write base metadata (not delta) for MULTI/EXEC transactions. The delta-emit approach (0c33888) caused Jepsen G0-realtime anomalies because commitTS pre-allocation in commit() interacted poorly with the coordinator's timestamp assignment. The base-meta + delta-cleanup approach (4cf3a8e) passed Jepsen and correctly handles the dual-path by atomically deleting existing deltas when writing base metadata. Also re-enable TestRedis_MultiExec_DelThenRPushRecreatesList which now passes with the base-meta write path.
The root cause of the Jepsen G0-realtime anomaly: maxLatestCommitTS only checked base metadata keys (!lst|meta|<key>), but standalone RPUSH writes only delta keys (!lst|meta|d|...) and item keys. This caused txnStartTS to return a stale timestamp that predated recent standalone RPUSH commits. When MULTI/EXEC used this stale startTS, resolveListMeta missed recent deltas, leading to incorrect base metadata writes and delta cleanup that erased contributions from concurrent standalone RPUSHes. Fix: appendLatestDeltaKeys reverse-scans the delta key prefix for each list key in the transaction to find the most recent delta commit, ensuring startTS accounts for all standalone RPUSH operations.
No description provided.