Delta: Updating Delta to Iceberg conversion by vladislav-sidorovich · Pull Request #15407 · apache/iceberg

vladislav-sidorovich · 2026-02-22T19:55:55Z

Current PRs contains initial version of the code to update of the existing functionality: https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ to the recent Delta Lake version (read: 3, write: 7). The motivation of the PR is to receive the earliest feedback from the community.

Note: The PR doesn't remove the old logic but adds new Interface implementation, so it will be easier to compare/review. Also base on the usage scenario of the module, such approach will not introduce any issues.
More detailed development doc.

The PR scope:

Support existing interface
Uses Delta Lake kernel library instead of deprecated Delta Lake standalone
Contains the basic flow
Converts all data types
Converts table schema and partitions spec
Support only INSERT operation (Delta Lake Add action)
Support UPDATES and DELETS (Delta Lake Remove action)
Support Delta VACUUM scenario
Support DVs

Future steps:

Support All Delta Lake actions
Support All Delta Lake features (column mapping, generated columns and so on)
Handle Edge cases for partitions and Generated columns
Handle Schema evolution
Incremental Conversion (from/to a specific Delta Version)

Tests:
Unit-tests: contains all supported datatypes including complex arrays and structures.
Integration-tests: contains inserts only scenario with Spark 3.5. The test must be updated for newer Delta Lake version once the previous solution will be deleted from the code.

In the following PRs, I will add all the tables from: Delta golden tables

anoopj

Thank you for the PR. Moving to the Delta kernel is a great improvement. Here is my initial feedback.

anoopj · 2026-02-24T02:28:01Z

+import io.delta.kernel.exceptions.TableNotFoundException;
+import io.delta.kernel.internal.DeltaHistoryManager;
+import io.delta.kernel.internal.DeltaLogActionUtils;
+import io.delta.kernel.internal.SnapshotImpl;


We are using internal APIs of the kernel. This is fragile - can we refactor this to use the public APIs instead? Snapshot, Table etc. Or are we doing this because we are trying to preserve the table history during the conversion? I would try to avoid this as much as possible.

No, there are no public API available for these purposes we need.

Yes, I want go through table history step by step, so we will have exactly the same granularity in the history.

At the same time it's quite safe to use an internal API because it's depends on the Delta protocol which is stable.

The internal APIs can change or disappear without any notice. I would think hard about avoiding dependencies on internal APIs, including changing semantics. (e.g. not preserving all the history by default).

anoopj · 2026-02-24T02:38:50Z

+      while (rows.hasNext()) {
+        Row row = rows.next();
+        if (DeltaLakeActionsTranslationUtil.isAdd(row)) {
+          AddFile addFile = DeltaLakeActionsTranslationUtil.toAdd(row);


Can we avoid the use of the internal AddFile class and read fields directly from the Row using ordinals defined by the scan file schema?

Yes, I will refactor this part after all the conversion features will be in place.

vladislav-sidorovich · 2026-03-23T15:06:37Z

@nastra since you kindly reviewed the earlier version, I'd love to get your thoughts on the updated core logic before I do the final refactoring to remove internal Delta classes.

@aokolnychyi Since you’ve contributed so much to the Deletion Vectors implementation in Iceberg, I wanted to reach out. Could you take a quick look at the DV conversion logic in my PR to make sure I’ve wired everything up correctly?

laskoviymishka

Thanks for the great work on this — migrating to delta-kernel and adding DV support is a substantial lift, and the golden-table test scaffolding is going to pay off long-term.

I've left some blocker notes inline, mostly around delta-protocol correctness:

DV-update (Add+Remove of the same path) is the central DELETE/MERGE/UPDATE path on DV-enabled Delta tables and currently lands as addRows+removeRows on the same file, which won't produce a valid Iceberg snapshot.
Per-ColumnarBatch commits break single-Delta-version atomicity — needs to buffer per version and emit one Iceberg commit per Delta commit.
Unhandled metaData/protocol actions are silently dropped, so mid-history schema evolution produces a corrupt target table.
DV test asserts presence (hasDeleteFiles == true) rather than row-level correctness, which is why the above bugs aren't caught.

Given the PR size (~2k lines + golden tables) my review will be slower than usual, but I wanted to surface the blockers now so you can start on them in parallel.

Happy to iterate once the above are addressed — nothing here is structural, more about tightening the action-translation loop and strengthening the DV assertions.

One ask for follow-ups: if we can scope future PRs into smaller chunks (e.g., kernel migration as one PR, DV support as another, golden-table suite as a third), it would be much easier to give each piece the review attention it deserves and land changes faster.

Totally understand that the initial cut benefits from being end-to-end to get community feedback, just flagging for the next iterations.

laskoviymishka · 2026-04-24T07:44:59Z

+      // Avoid validation for multiple DVs added in transaction
+      // org/apache/iceberg/MergingSnapshotProducer.java:854
+      // since we do the conversion sequentially in a single Iceberg transaction
+      rowDelta.validateFromSnapshot(transaction.table().currentSnapshot().snapshotId());


Setting the validate-from snapshot to currentSnapshot() disables RowDelta's conflict validation wholesale, which is what's masking the Add+Remove-same-path bug above in tests. Please remove this or scope it to the narrower concern (e.g. validateDataFilesExist on a specific set) and let the real validation run.

laskoviymishka · 2026-04-24T07:48:50Z

+  }
+
+  @Test
+  public void testDeltaTableDVSupported() throws Exception {


This only checks hasDeleteFiles == true and isDV(deleteFile). It never scans the converted Iceberg table to confirm row counts match the post-DV Delta row count, or that the specific rows Delta marked deleted are absent from Iceberg reads.

The dv-partitioned-with-checkpoint golden exercises DV updates and file removes (V6–V13) — please assert row-level parity (count + sampled content) so the bugs in comments 1–3 cannot pass silently.

Same concern for goldenDeltaTableConversion at line 322–335, which only asserts execute() doesn't throw.

This unit-test verify the conversion flow and some execution conditions. Additionally it's verify what Delta table features supported and not supported.

Testing for table's data done in integration test org.apache.iceberg.delta.TestSnapshotDeltaLakeKernelTable#testConversionWithDeletionVectors.

Yes, testConversionWithDeletionVectors does compare row contents, but It only exercises a single UPDATE (one DV-update commit).

The DV-update bugs surface on:

Two consecutive DV-updates on the same file:

first UPDATE writes DV1

second UPDATE writes DV2 with no retraction of DV1 → orphan Puffin, violates the v3 "≤1 DV per data file" invariant. SELECT * on the final snapshot can still pass while the metadata is broken.

Time-travel via delta-version-N tags: the test asserts tag presence but not that SELECT * AS OF VERSION delta-version-K matches Delta's VERSION AS OF K. Per-version correctness is the actual user-facing contract for this conversion.

Maybe worth to extend the integration test with those, both are a few lines on top of the existing fixture.

I'm not sure that this can be opt-out here, since it may highlight a bug.

HonahX

Thanks for the great effort in moving to delta kernel given that delta standalone is archived and deprecated! Left some comments, please let me know WDYT!

HonahX · 2026-04-24T19:34:39Z

+    assertThat(deleteDeltaLogFile("00000000000000000000.json")).isTrue();
+    assertThat(deleteDeltaLogFile("00000000000000000001.json")).isTrue();
+    assertThat(deleteDeltaLogFile("00000000000000000002.json")).isTrue();


Does the log clean-up always to happen with the VACUUM? Will there be case where the old logs are not cleaned up and our conversion fails because of the missing data file (deleted by VACUUM)?
Seems this indicate some requirement for conversion to work when VACUUM is called on the source table, could you please elaborate on the scope of supporting vacuum in this PR? Thanks!

The short answers:

No, logs clean-up is not the same as VACUUM

The conversation will not failed.

Explanations:
The logs clean-up and VACUUM are 2 parallel and different processes. In this test I intentionally delete some logs (not all) to make the delta versions not re-creatable and not continuous (started from table creation). For example in this test Delta versions 3-9 are exist but are not re-creatable because the data files were deleted by VACUUM.
So the first re-creatable version in this test is 10, this version used as initial version in the conversion and the test is for that scenario.

To sum-up:

VACUUM and logs clean-up are 2 different processes.

In this test I first execute VACUUM with RETAIN 0 HOURS.

After some DML operation I simulated (as written in the comment) logs clean-up.

Verify that conversion works as expected.

HonahX · 2026-04-24T19:50:56Z

+
+  private void commitDeltaSnapshotToIcebergTransaction(
+      SnapshotImpl deltaSnapshot, Transaction transaction, Set<String> processedDataFiles)
+      throws IOException {


In the old implementation, the commitInitialDeltaSnapshotToIcebergTransaction, will first to find the actual earliest reconstructable version that have all the data files available. That could help reduce the conversion failure due to VACUUM cleaned up files and breaks time travel ability. What's the design choice behind not using that in the new impl?

The intention is to support VACUUM operation as well and there is a dedicated test for it org.apache.iceberg.delta.TestSnapshotDeltaLakeKernelTable#testConversionAfterVacuum.

The logic is the very similar to the previous one:

Find first retractable commit => Create Iceberg commit.

Go 1 by 1 per Delta version after the recreatable commit.

The diff is in commitDeltaSnapshotToIcebergTransaction vs convertEachDeltaVersion. While methods are similar, the way how to collect required data files is different.

HonahX · 2026-04-24T20:33:02Z

+  @TempDir private File sourceLocation;
+  @TempDir private File destinationLocation;
+
+  public TestSnapshotDeltaLakeKernelTable() {


Shall we add tests for unpartitioned table and a table with possible all data types? Also would be good if we have some check similar to checkDataFilePathsIntegrity to verify that data file path is absolute and matches those in delta table.

There is unit-test to cover all possible data types TestDeltaLakeKernelTypeToType including structs.
Unpartitioned table also covered by unit-test in TestBaseSnapshotDeltaLakeKernelTableAction. Actually, I tried to move more logic to unit tests.
I will also add more tables from https://github.com/delta-io/delta/tree/master/connectors/golden-tables/src , so unit-test will be robust. I removed these tables from the initial PR to reduce number of files in the PR.

Regarding checkDataFilePathsIntegrity I'm not 100% sure. There is no simply utility method that will help to collect all data files, so it means we will need to introduce a logic into the test similar to one we have in the conversion. At the same time this test has checkSnapshotIntegrityForQuery which verify full data for the table per tag/version. Could you clarify what test scenarios is missing and what do you want to test more?

Refactored getFullFilePath to use org.apache.hadoop.fs.Path for more robust absolute and local path handling. Expanded TestDeltaLakePathHandling with additional edge cases including nested paths and special characters. Fixed several PatternMatchingInstanceof warnings and resolved var usage in BaseSnapshotDeltaLakeKernelTableAction.

vladislav-sidorovich added 11 commits December 8, 2025 22:53

Add Delta to Iceberg types conversion

1ff13bc

Add golden Delta lake tables for tests

b84fcde

Add table creation only from delta table source

4d8e2ce

Add delta properties from SnapshotImpl

1dd1f0b

Use Earliest delta version for initial Iceberg transaction

f706d2d

add append only conversion support

afc9d70

Add Spark conversion test

08320eb

Fix code style

a5c0a43

Fix code style

2d729ed

Delete delta golden tables

987fe5c

Update tests for inserts only conversion

cf4a590

github-actions Bot added docs build labels Feb 22, 2026

anoopj reviewed Feb 24, 2026

View reviewed changes

Comment thread delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java

vladislav-sidorovich added 6 commits February 25, 2026 22:19

Fix error messages and Execption types

cdc5839

Add license header and remove unnecessary log

f4fbea2

fix typo

1ecbfee

Assert Delta Lake column mapping feature before conversion

b44084f

Handle empty tables conversion

e912f64

Add support of update and delete operations for Delta conversion

28c9f3f

vladislav-sidorovich changed the title ~~Delta: Updating Delta to Iceberg conversion - Inserts only~~ Delta: Updating Delta to Iceberg conversion Mar 1, 2026

vladislav-sidorovich added 5 commits March 3, 2026 22:47

make utility classes package-private

e79a555

Support conversion from Delta tables after VACUUM operation

d7a1e64

Add tests for snapshots

eb7cf95

Read Delta DVs draft

5cbebc6

Read Delta DVs draft

4dd3c8a

github-actions Bot added the data label Mar 22, 2026

Support DV conversion from Delta to Iceberg

aa9f316

vladislav-sidorovich force-pushed the delta-conversion branch from 842ee68 to aa9f316 Compare March 22, 2026 14:20

vladislav-sidorovich added 2 commits March 22, 2026 15:32

Rollback TestDVWriters.java

40c807d

exclude delta _last_checkpoint from licence check

41e32bd

github-actions Bot added the INFRA label Mar 22, 2026

vladislav-sidorovich added 3 commits March 22, 2026 15:43

Merge branch 'main' into delta-conversion

fa82408

merge rat-excludes

e860e5d

Fix path in .rat-excludes

3138014

vladislav-sidorovich requested a review from anoopj March 22, 2026 18:29

vladislav-sidorovich added 2 commits April 19, 2026 17:05

Handle fast writes on source delta table for conversion

99bbb80

Fix possible NPE issues and improve work with file path.

689b037

laskoviymishka requested changes Apr 24, 2026

View reviewed changes

HonahX reviewed Apr 24, 2026

View reviewed changes

vladislav-sidorovich added 7 commits April 26, 2026 18:08

Move Delta Kernel lib version to toml file and fix tags counts in tests

5a19332

Revert changes in BaseSnapshotDeltaLakeTableAction

a110c38

Makes utility class package-private

1a433ed

Load the latest Delta snapshot once

13e3705

Fix. Commit Iceberg per Delta version

3015ab5

Revert local gradle properties

8c181ab

Conversation

vladislav-sidorovich commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anoopj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladislav-sidorovich commented Mar 23, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vladislav-sidorovich commented Feb 22, 2026 •

edited

Loading