You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve previously worked on upsert internals (e.g. #1878 on complex type comparison fallbacks and #1995 on transaction semantics and batch scanning). While working in this area, I started thinking that we may benefit from clarifying the intended positioning and strategy of upsert.
I’d like to frame this discussion from workload impact first, then implementation.
1. Start from impact
In certain workloads, especially sparse updates into large tables, upsert can cause disproportionate:
IO
memory usage
small file creation
relative to the number of rows modified.
This suggests we may be implicitly optimizing for one class of workload while others pay a high cost.
2. Representative use cases
Some distinct scenarios:
1. Sparse micro updates
Upsert 10 random rows into a 10M row table, bucketed by user_id.
Expectation:
Minimal file rewrites
Predictable memory usage
No explosion of small files
2. Incremental ingestion
10k to 200k rows per batch, mostly inserts, some updates.
Expectation:
Good write throughput
Stable resource usage
3. Large backfill / bulk merge
Millions of rows, high match rate.
Expectation:
Efficient batch processing
Predictable rewrite cost
4. Memory constrained environments
Small containers or serverless.
Expectation:
No pathological memory spikes
No extreme expression or join planning overhead
It would help to clarify which of these upsert is primarily optimized for.
3. Current conceptual model
Conceptually, upsert behaves like:
Delete matching rows
Insert new rows
For Copy on Write behavior, the delete step already rewrites the data files containing matching rows, even if only a single row matches.
The insert step then writes updated rows again as new data files.
In sparse scenarios, this can mean:
Full file rewrites for single row updates
Additional small files created by the insert step
Redundant IO for rows that are effectively being replaced
This is logically correct, but not always optimal.
4. Alternative framing: file level rewrite
An alternative way to conceptualize upsert (for Copy on Write style tables) would be:
Identify affected data files
Load each affected file once
Apply updates in memory
Write merged files
Commit as file replacements in a single snapshot
Instead of modeling upsert as delete + insert, model it as:
Rewrite affected files with merged content.
This could reduce redundant IO and small file creation, at the cost of different memory and planning trade offs. This is similar to how AWS Athena does a MERGE INTO.
More broadly, there appear to be at least two conceptual strategies:
Merge on Read style: delete files + append
Copy on Write style: rewrite affected data files
It may be worth documenting or formalizing this distinction, even if not immediately user configurable.
5. Python vs Rust execution
In some discussions, there is an implicit suggestion that deeper optimization should wait for a Rust execution layer.
That is a valid long term direction. However, it would help to clarify:
Is the Python upsert intended to be production grade and performance competitive?
Or is it primarily a functional implementation that will eventually defer heavy execution to Rust?
If Python is first class, then strategy clarity and optimization likely matter now.
If not, that should be documented so expectations are aligned.
6. Main questions
Which workload is upsert primarily optimized for?
Should strategy (MoR vs CoW style planning) be explicit or configurable?
Should we document clear trade offs and expected behavior per workload?
What is the intended long term role of Python vs Rust in execution?
I’m happy to help structure documentation or experiments around these workload categories.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Upsert in PyIceberg: Use Cases, Trade Offs, and Strategy
There are multiple ongoing discussions around
upsert, including:I’ve previously worked on
upsertinternals (e.g. #1878 on complex type comparison fallbacks and #1995 on transaction semantics and batch scanning). While working in this area, I started thinking that we may benefit from clarifying the intended positioning and strategy ofupsert.I’d like to frame this discussion from workload impact first, then implementation.
1. Start from impact
In certain workloads, especially sparse updates into large tables,
upsertcan cause disproportionate:relative to the number of rows modified.
This suggests we may be implicitly optimizing for one class of workload while others pay a high cost.
2. Representative use cases
Some distinct scenarios:
1. Sparse micro updates
Upsert 10 random rows into a 10M row table, bucketed by
user_id.Expectation:
2. Incremental ingestion
10k to 200k rows per batch, mostly inserts, some updates.
Expectation:
3. Large backfill / bulk merge
Millions of rows, high match rate.
Expectation:
4. Memory constrained environments
Small containers or serverless.
Expectation:
It would help to clarify which of these
upsertis primarily optimized for.3. Current conceptual model
Conceptually,
upsertbehaves like:For Copy on Write behavior, the delete step already rewrites the data files containing matching rows, even if only a single row matches.
The insert step then writes updated rows again as new data files.
In sparse scenarios, this can mean:
This is logically correct, but not always optimal.
4. Alternative framing: file level rewrite
An alternative way to conceptualize upsert (for Copy on Write style tables) would be:
Instead of modeling upsert as delete + insert, model it as:
Rewrite affected files with merged content.
This could reduce redundant IO and small file creation, at the cost of different memory and planning trade offs. This is similar to how AWS Athena does a
MERGE INTO.More broadly, there appear to be at least two conceptual strategies:
It may be worth documenting or formalizing this distinction, even if not immediately user configurable.
5. Python vs Rust execution
In some discussions, there is an implicit suggestion that deeper optimization should wait for a Rust execution layer.
That is a valid long term direction. However, it would help to clarify:
upsertintended to be production grade and performance competitive?If Python is first class, then strategy clarity and optimization likely matter now.
If not, that should be documented so expectations are aligned.
6. Main questions
upsertprimarily optimized for?I’m happy to help structure documentation or experiments around these workload categories.
Beta Was this translation helpful? Give feedback.
All reactions