Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

koenvo · 2026-03-03T12:38:33Z

koenvo
Mar 3, 2026

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy

There are multiple ongoing discussions around upsert, including:

I’ve previously worked on upsert internals (e.g. #1878 on complex type comparison fallbacks and #1995 on transaction semantics and batch scanning). While working in this area, I started thinking that we may benefit from clarifying the intended positioning and strategy of upsert.

I’d like to frame this discussion from workload impact first, then implementation.

1. Start from impact

In certain workloads, especially sparse updates into large tables, upsert can cause disproportionate:

IO
memory usage
small file creation

relative to the number of rows modified.

This suggests we may be implicitly optimizing for one class of workload while others pay a high cost.

2. Representative use cases

Some distinct scenarios:

1. Sparse micro updates

Upsert 10 random rows into a 10M row table, bucketed by user_id.
Expectation:

Minimal file rewrites
Predictable memory usage
No explosion of small files

2. Incremental ingestion

10k to 200k rows per batch, mostly inserts, some updates.
Expectation:

Good write throughput
Stable resource usage

3. Large backfill / bulk merge

Millions of rows, high match rate.
Expectation:

Efficient batch processing
Predictable rewrite cost

4. Memory constrained environments

Small containers or serverless.
Expectation:

No pathological memory spikes
No extreme expression or join planning overhead

It would help to clarify which of these upsert is primarily optimized for.

3. Current conceptual model

Conceptually, upsert behaves like:

Delete matching rows
Insert new rows

For Copy on Write behavior, the delete step already rewrites the data files containing matching rows, even if only a single row matches.

The insert step then writes updated rows again as new data files.

In sparse scenarios, this can mean:

Full file rewrites for single row updates
Additional small files created by the insert step
Redundant IO for rows that are effectively being replaced

This is logically correct, but not always optimal.

4. Alternative framing: file level rewrite

An alternative way to conceptualize upsert (for Copy on Write style tables) would be:

Identify affected data files
Load each affected file once
Apply updates in memory
Write merged files
Commit as file replacements in a single snapshot

Instead of modeling upsert as delete + insert, model it as:

Rewrite affected files with merged content.

This could reduce redundant IO and small file creation, at the cost of different memory and planning trade offs. This is similar to how AWS Athena does a MERGE INTO.

More broadly, there appear to be at least two conceptual strategies:

Merge on Read style: delete files + append
Copy on Write style: rewrite affected data files

It may be worth documenting or formalizing this distinction, even if not immediately user configurable.

5. Python vs Rust execution

In some discussions, there is an implicit suggestion that deeper optimization should wait for a Rust execution layer.

That is a valid long term direction. However, it would help to clarify:

Is the Python upsert intended to be production grade and performance competitive?
Or is it primarily a functional implementation that will eventually defer heavy execution to Rust?

If Python is first class, then strategy clarity and optimization likely matter now.

If not, that should be documented so expectations are aligned.

6. Main questions

Which workload is upsert primarily optimized for?
Should strategy (MoR vs CoW style planning) be explicit or configurable?
Should we document clear trade offs and expected behavior per workload?
What is the intended long term role of Python vs Rust in execution?

I’m happy to help structure documentation or experiments around these workload categories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy #3118

Uh oh!

koenvo Mar 3, 2026

Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy

1. Start from impact

2. Representative use cases

1. Sparse micro updates

2. Incremental ingestion

3. Large backfill / bulk merge

4. Memory constrained environments

3. Current conceptual model

4. Alternative framing: file level rewrite

5. Python vs Rust execution

6. Main questions

Replies: 0 comments

koenvo
Mar 3, 2026