Skip to content

Delta: Updating Delta to Iceberg conversion#15407

Open
vladislav-sidorovich wants to merge 37 commits intoapache:mainfrom
vladislav-sidorovich:delta-conversion
Open

Delta: Updating Delta to Iceberg conversion#15407
vladislav-sidorovich wants to merge 37 commits intoapache:mainfrom
vladislav-sidorovich:delta-conversion

Conversation

@vladislav-sidorovich
Copy link
Copy Markdown
Contributor

@vladislav-sidorovich vladislav-sidorovich commented Feb 22, 2026

Current PRs contains initial version of the code to update of the existing functionality: https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ to the recent Delta Lake version (read: 3, write: 7). The motivation of the PR is to receive the earliest feedback from the community.

Note: The PR doesn't remove the old logic but adds new Interface implementation, so it will be easier to compare/review. Also base on the usage scenario of the module, such approach will not introduce any issues.
More detailed development doc.

The PR scope:

  1. Support existing interface
  2. Uses Delta Lake kernel library instead of deprecated Delta Lake standalone
  3. Contains the basic flow
  4. Converts all data types
  5. Converts table schema and partitions spec
  6. Support only INSERT operation (Delta Lake Add action)
  7. Support UPDATES and DELETS (Delta Lake Remove action)
  8. Support Delta VACUUM scenario
  9. Support DVs

Future steps:

  1. Support All Delta Lake actions
  2. Support All Delta Lake features (column mapping, generated columns and so on)
  3. Handle Edge cases for partitions and Generated columns
  4. Handle Schema evolution
  5. Incremental Conversion (from/to a specific Delta Version)

Tests:
Unit-tests: contains all supported datatypes including complex arrays and structures.
Integration-tests: contains inserts only scenario with Spark 3.5. The test must be updated for newer Delta Lake version once the previous solution will be deleted from the code.

In the following PRs, I will add all the tables from: Delta golden tables

Copy link
Copy Markdown
Contributor

@anoopj anoopj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. Moving to the Delta kernel is a great improvement. Here is my initial feedback.

Comment thread delta-lake/src/test/resources/delta/golden/README.md
import io.delta.kernel.exceptions.TableNotFoundException;
import io.delta.kernel.internal.DeltaHistoryManager;
import io.delta.kernel.internal.DeltaLogActionUtils;
import io.delta.kernel.internal.SnapshotImpl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using internal APIs of the kernel. This is fragile - can we refactor this to use the public APIs instead? Snapshot, Table etc. Or are we doing this because we are trying to preserve the table history during the conversion? I would try to avoid this as much as possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are no public API available for these purposes we need.

Yes, I want go through table history step by step, so we will have exactly the same granularity in the history.

At the same time it's quite safe to use an internal API because it's depends on the Delta protocol which is stable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal APIs can change or disappear without any notice. I would think hard about avoiding dependencies on internal APIs, including changing semantics. (e.g. not preserving all the history by default).

while (rows.hasNext()) {
Row row = rows.next();
if (DeltaLakeActionsTranslationUtil.isAdd(row)) {
AddFile addFile = DeltaLakeActionsTranslationUtil.toAdd(row);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the use of the internal AddFile class and read fields directly from the Row using ordinals defined by the scan file schema?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will refactor this part after all the conversion features will be in place.

@vladislav-sidorovich vladislav-sidorovich changed the title Delta: Updating Delta to Iceberg conversion - Inserts only Delta: Updating Delta to Iceberg conversion Mar 1, 2026
@github-actions github-actions Bot added the data label Mar 22, 2026
@github-actions github-actions Bot added the INFRA label Mar 22, 2026
@vladislav-sidorovich
Copy link
Copy Markdown
Contributor Author

@nastra since you kindly reviewed the earlier version, I'd love to get your thoughts on the updated core logic before I do the final refactoring to remove internal Delta classes.

@aokolnychyi Since you’ve contributed so much to the Deletion Vectors implementation in Iceberg, I wanted to reach out. Could you take a quick look at the DV conversion logic in my PR to make sure I’ve wired everything up correctly?

Copy link
Copy Markdown

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work on this — migrating to delta-kernel and adding DV support is a substantial lift, and the golden-table test scaffolding is going to pay off long-term.

I've left some blocker notes inline, mostly around delta-protocol correctness:

  • DV-update (Add+Remove of the same path) is the central DELETE/MERGE/UPDATE path on DV-enabled Delta tables and currently lands as addRows+removeRows on the same file, which won't produce a valid Iceberg snapshot.
  • Per-ColumnarBatch commits break single-Delta-version atomicity — needs to buffer per version and emit one Iceberg commit per Delta commit.
  • Unhandled metaData/protocol actions are silently dropped, so mid-history schema evolution produces a corrupt target table.
  • DV test asserts presence (hasDeleteFiles == true) rather than row-level correctness, which is why the above bugs aren't caught.

Given the PR size (~2k lines + golden tables) my review will be slower than usual, but I wanted to surface the blockers now so you can start on them in parallel.

Happy to iterate once the above are addressed — nothing here is structural, more about tightening the action-translation loop and strengthening the DV assertions.

One ask for follow-ups: if we can scope future PRs into smaller chunks (e.g., kernel migration as one PR, DV support as another, golden-table suite as a third), it would be much easier to give each piece the review attention it deserves and land changes faster.

Totally understand that the initial cut benefits from being end-to-end to get community feedback, just flagging for the next iterations.

// Avoid validation for multiple DVs added in transaction
// org/apache/iceberg/MergingSnapshotProducer.java:854
// since we do the conversion sequentially in a single Iceberg transaction
rowDelta.validateFromSnapshot(transaction.table().currentSnapshot().snapshotId());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the validate-from snapshot to currentSnapshot() disables RowDelta's conflict validation wholesale, which is what's masking the Add+Remove-same-path bug above in tests. Please remove this or scope it to the narrower concern (e.g. validateDataFilesExist on a specific set) and let the real validation run.

}

@Test
public void testDeltaTableDVSupported() throws Exception {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks hasDeleteFiles == true and isDV(deleteFile). It never scans the converted Iceberg table to confirm row counts match the post-DV Delta row count, or that the specific rows Delta marked deleted are absent from Iceberg reads.

The dv-partitioned-with-checkpoint golden exercises DV updates and file removes (V6–V13) — please assert row-level parity (count + sampled content) so the bugs in comments 1–3 cannot pass silently.

Same concern for goldenDeltaTableConversion at line 322–335, which only asserts execute() doesn't throw.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit-test verify the conversion flow and some execution conditions. Additionally it's verify what Delta table features supported and not supported.

Testing for table's data done in integration test org.apache.iceberg.delta.TestSnapshotDeltaLakeKernelTable#testConversionWithDeletionVectors.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, testConversionWithDeletionVectors does compare row contents, but It only exercises a single UPDATE (one DV-update commit).

The DV-update bugs surface on:

  1. Two consecutive DV-updates on the same file:
  • first UPDATE writes DV1
  • second UPDATE writes DV2 with no retraction of DV1 → orphan Puffin, violates the v3 "≤1 DV per data file" invariant. SELECT * on the final snapshot can still pass while the metadata is broken.
  1. Time-travel via delta-version-N tags: the test asserts tag presence but not that SELECT * AS OF VERSION delta-version-K matches Delta's VERSION AS OF K. Per-version correctness is the actual user-facing contract for this conversion.

Maybe worth to extend the integration test with those, both are a few lines on top of the existing fixture.

I'm not sure that this can be opt-out here, since it may highlight a bug.

Copy link
Copy Markdown
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort in moving to delta kernel given that delta standalone is archived and deprecated! Left some comments, please let me know WDYT!

Comment on lines +188 to +190
assertThat(deleteDeltaLogFile("00000000000000000000.json")).isTrue();
assertThat(deleteDeltaLogFile("00000000000000000001.json")).isTrue();
assertThat(deleteDeltaLogFile("00000000000000000002.json")).isTrue();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the log clean-up always to happen with the VACUUM? Will there be case where the old logs are not cleaned up and our conversion fails because of the missing data file (deleted by VACUUM)?
Seems this indicate some requirement for conversion to work when VACUUM is called on the source table, could you please elaborate on the scope of supporting vacuum in this PR? Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answers:

  1. No, logs clean-up is not the same as VACUUM
  2. The conversation will not failed.

Explanations:
The logs clean-up and VACUUM are 2 parallel and different processes. In this test I intentionally delete some logs (not all) to make the delta versions not re-creatable and not continuous (started from table creation). For example in this test Delta versions 3-9 are exist but are not re-creatable because the data files were deleted by VACUUM.
So the first re-creatable version in this test is 10, this version used as initial version in the conversion and the test is for that scenario.

To sum-up:

  1. VACUUM and logs clean-up are 2 different processes.
  2. In this test I first execute VACUUM with RETAIN 0 HOURS.
  3. After some DML operation I simulated (as written in the comment) logs clean-up.
  4. Verify that conversion works as expected.


private void commitDeltaSnapshotToIcebergTransaction(
SnapshotImpl deltaSnapshot, Transaction transaction, Set<String> processedDataFiles)
throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old implementation, the commitInitialDeltaSnapshotToIcebergTransaction, will first to find the actual earliest reconstructable version that have all the data files available. That could help reduce the conversion failure due to VACUUM cleaned up files and breaks time travel ability. What's the design choice behind not using that in the new impl?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to support VACUUM operation as well and there is a dedicated test for it org.apache.iceberg.delta.TestSnapshotDeltaLakeKernelTable#testConversionAfterVacuum.

The logic is the very similar to the previous one:

  1. Find first retractable commit => Create Iceberg commit.
  2. Go 1 by 1 per Delta version after the recreatable commit.

The diff is in commitDeltaSnapshotToIcebergTransaction vs convertEachDeltaVersion. While methods are similar, the way how to collect required data files is different.

Comment thread build.gradle Outdated
Comment thread delta-lake/src/test/resources/delta/golden/README.md
Comment thread delta-lake/src/main/java/org/apache/iceberg/delta/DeletionVectorConverter.java Outdated
@TempDir private File sourceLocation;
@TempDir private File destinationLocation;

public TestSnapshotDeltaLakeKernelTable() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add tests for unpartitioned table and a table with possible all data types? Also would be good if we have some check similar to checkDataFilePathsIntegrity to verify that data file path is absolute and matches those in delta table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is unit-test to cover all possible data types TestDeltaLakeKernelTypeToType including structs.
Unpartitioned table also covered by unit-test in TestBaseSnapshotDeltaLakeKernelTableAction. Actually, I tried to move more logic to unit tests.
I will also add more tables from https://github.com/delta-io/delta/tree/master/connectors/golden-tables/src , so unit-test will be robust. I removed these tables from the initial PR to reduce number of files in the PR.

Regarding checkDataFilePathsIntegrity I'm not 100% sure. There is no simply utility method that will help to collect all data files, so it means we will need to introduce a logic into the test similar to one we have in the conversion. At the same time this test has checkSnapshotIntegrityForQuery which verify full data for the table per tag/version. Could you clarify what test scenarios is missing and what do you want to test more?

Refactored getFullFilePath to use org.apache.hadoop.fs.Path for more robust absolute and local path handling.
Expanded TestDeltaLakePathHandling with additional edge cases including nested paths and special characters.
Fixed several PatternMatchingInstanceof warnings and resolved var usage in BaseSnapshotDeltaLakeKernelTableAction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants