Skip to content

storage: Explicitly block cloud GC if archival STM not present#29674

Closed
oleiman wants to merge 2 commits intoredpanda-data:devfrom
oleiman:ts/noticket/block-gc-if-no-archival
Closed

storage: Explicitly block cloud GC if archival STM not present#29674
oleiman wants to merge 2 commits intoredpanda-data:devfrom
oleiman:ts/noticket/block-gc-if-no-archival

Conversation

@oleiman
Copy link
Copy Markdown
Member

@oleiman oleiman commented Feb 23, 2026

If cloud retention is active but no archival STM is registered, it means cloud_storage_enabled was false at startup (so the archival_metadata_stm was never created) but was later changed to true at runtime. The config value propagates live but the STM requires a restart to be instantiated. Without the archival STM there is no safety clamp to prevent eviction of data that has not yet been uploaded to tiered storage, so refuse to report any reclaimable offsets to avoid data loss.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • Prevent a data loss scenario when tiered storage is partially initialized.

@oleiman oleiman self-assigned this Feb 23, 2026
@oleiman oleiman marked this pull request as ready for review February 23, 2026 05:54
Copilot AI review requested due to automatic review settings February 23, 2026 05:54
@oleiman oleiman marked this pull request as draft February 23, 2026 05:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents a critical data loss scenario that occurs when cloud_storage_enabled is changed from false to true at runtime without a restart. In this situation, cloud retention becomes active but the archival_metadata_stm (which tracks uploaded segments) was never instantiated, causing max_removable_local_log_offset() to return offset::max() and allowing eviction of data not yet uploaded to cloud storage.

Changes:

  • Added archival enum value to stm_type and implemented type() method in archival_metadata_stm
  • Added has_archival_stm() method to stm_manager to check for archival STM presence
  • Added safety guards in set_cloud_gc_offset() and get_reclaimable_offsets() to block operations when cloud retention is active but no archival STM is present

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/v/storage/types.h Adds archival enum value to stm_type, adds has_archival_stm() method and _has_archival_stm tracking flag to stm_manager
src/v/cluster/archival/archival_metadata_stm.h Implements type() override to return stm_type::archival
src/v/storage/disk_log_impl.cc Adds safety guards in set_cloud_gc_offset() and get_reclaimable_offsets() to prevent eviction without archival STM
src/v/storage/tests/log_retention_test.cc Adds comprehensive regression test simulating the runtime config change scenario

If cloud retention is active but no archival STM is registered, it means
cloud_storage_enabled was false at startup (so the archival_metadata_stm
was never created) but was later changed to true at runtime. The config
value propagates live but the STM requires a restart to be instantiated.
Without the archival STM there is no safety clamp to prevent eviction of
data that has not yet been uploaded to tiered storage, so refuse to
report any reclaimable offsets to avoid data loss.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ts/noticket/block-gc-if-no-archival branch from 948f467 to 253aa50 Compare February 23, 2026 06:46
If cloud retention is active but no archival STM is registered, it
means cloud_storage_enabled was toggled on at runtime without a
restart. The archival subsystem (and its STM) requires a restart to
be instantiated. Applying aggressive local retention overrides
without the archival STM could evict data that has never been
uploaded to tiered storage, causing permanent data loss. Fall back
to regular Kafka retention until the node is restarted.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ts/noticket/block-gc-if-no-archival branch from 253aa50 to 69bb4f7 Compare February 23, 2026 06:47
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Feb 23, 2026

/ci-repeat 1

@oleiman oleiman marked this pull request as ready for review February 23, 2026 06:51
@oleiman oleiman closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants