Skip to content

Allow to read iceberg table data from any location#1461

Open
zvonand wants to merge 6 commits intoantalya-26.1from
backport/antalya-26.1/90740
Open

Allow to read iceberg table data from any location#1461
zvonand wants to merge 6 commits intoantalya-26.1from
backport/antalya-26.1/90740

Conversation

@zvonand
Copy link
Collaborator

@zvonand zvonand commented Feb 27, 2026

Supersedes #1092, #1163, #1212

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support Iceberg tables that have files outside table location or on different storage.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Feb 27, 2026

Workflow [PR], commit [82c6f88]

@zvonand zvonand force-pushed the backport/antalya-26.1/90740 branch from 66845e0 to c424634 Compare March 3, 2026 14:18
@vzakaznikov
Copy link
Collaborator

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1461 (Iceberg external location support):

Confirmed defects:

  • Medium: Inconsistent same-storage key normalization after prefetch exhaustion

    • Impact: Iceberg reads can fail when absolute-path files exceed prefetch window (max_threads), because later tasks may keep non-normalized path while still using base storage.
    • Anchor: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp / ReadTaskIterator::next() vs constructor prefetch branch.
    • Trigger: absolute same-storage paths with file count > prefetch size.
    • Why defect: path normalization behavior differs by iterator branch, so correctness depends on task position.
    • Fix direction (short): always apply resolved key rewrite when absolute path is present; keep storage-switch decision separate.
    • Regression test direction (short): integration case with >max_threads files using absolute same-storage paths and validation that all tasks resolve key consistently.
  • Low: Secondary storage creation is performed under shared cache mutex

    • Impact: high contention and elevated deadlock-risk surface under concurrent cold-cache resolutions.
    • Anchor: src/Storages/ObjectStorage/Utils.cpp / getOrCreateStorageAndKey.
    • Trigger: concurrent scans that resolve many external locations simultaneously.
    • Why defect: potentially heavy ObjectStorageFactory::create(...) is executed inside locked section.
    • Fix direction (short): use double-checked creation (check under lock, create outside lock, emplace under lock).
    • Regression test direction (short): concurrent resolver stress test over many distinct cache keys; assert no deadlock and bounded latency.

Coverage summary:

  • Scope reviewed: resolver/iterator/position-delete/protocol paths touched by this PR.
  • Categories failed: path normalization parity; cache-lock concurrency.
  • Categories passed: protocol compatibility and unsupported-scheme error handling.
  • Assumptions/limits: static audit only; no runtime test execution in this pass.

@zvonand zvonand added the port-antalya PRs to be ported to all new Antalya releases label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya-26.1 antalya-26.1.3.20001 port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants