storage: Explicitly block cloud GC if archival STM not present#29674
Closed
oleiman wants to merge 2 commits intoredpanda-data:devfrom
Closed
storage: Explicitly block cloud GC if archival STM not present#29674oleiman wants to merge 2 commits intoredpanda-data:devfrom
oleiman wants to merge 2 commits intoredpanda-data:devfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR prevents a critical data loss scenario that occurs when cloud_storage_enabled is changed from false to true at runtime without a restart. In this situation, cloud retention becomes active but the archival_metadata_stm (which tracks uploaded segments) was never instantiated, causing max_removable_local_log_offset() to return offset::max() and allowing eviction of data not yet uploaded to cloud storage.
Changes:
- Added
archivalenum value tostm_typeand implementedtype()method inarchival_metadata_stm - Added
has_archival_stm()method tostm_managerto check for archival STM presence - Added safety guards in
set_cloud_gc_offset()andget_reclaimable_offsets()to block operations when cloud retention is active but no archival STM is present
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/v/storage/types.h | Adds archival enum value to stm_type, adds has_archival_stm() method and _has_archival_stm tracking flag to stm_manager |
| src/v/cluster/archival/archival_metadata_stm.h | Implements type() override to return stm_type::archival |
| src/v/storage/disk_log_impl.cc | Adds safety guards in set_cloud_gc_offset() and get_reclaimable_offsets() to prevent eviction without archival STM |
| src/v/storage/tests/log_retention_test.cc | Adds comprehensive regression test simulating the runtime config change scenario |
If cloud retention is active but no archival STM is registered, it means cloud_storage_enabled was false at startup (so the archival_metadata_stm was never created) but was later changed to true at runtime. The config value propagates live but the STM requires a restart to be instantiated. Without the archival STM there is no safety clamp to prevent eviction of data that has not yet been uploaded to tiered storage, so refuse to report any reclaimable offsets to avoid data loss. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
948f467 to
253aa50
Compare
If cloud retention is active but no archival STM is registered, it means cloud_storage_enabled was toggled on at runtime without a restart. The archival subsystem (and its STM) requires a restart to be instantiated. Applying aggressive local retention overrides without the archival STM could evict data that has never been uploaded to tiered storage, causing permanent data loss. Fall back to regular Kafka retention until the node is restarted. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
253aa50 to
69bb4f7
Compare
Member
Author
|
/ci-repeat 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If cloud retention is active but no archival STM is registered, it means cloud_storage_enabled was false at startup (so the archival_metadata_stm was never created) but was later changed to true at runtime. The config value propagates live but the STM requires a restart to be instantiated. Without the archival STM there is no safety clamp to prevent eviction of data that has not yet been uploaded to tiered storage, so refuse to report any reclaimable offsets to avoid data loss.
Backports Required
Release Notes
Bug Fixes