Core, Spark: Add JMH benchmarks for Variants by steveloughran · Pull Request #15629 · apache/iceberg

steveloughran · 2026-03-13T21:25:53Z

core:VariantSerializationBenchmark

Separate benchmarks for

serializing a prebuilt object
deserializing

Variables are:

depth: [shallow, nested, deep-nested]
percentage of fields shed [0, 33, 67, 100]

spark-4.1:IcebergSourceVariantReadBenchmark

Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.

Key findings:

although it has the smallest file size, parquet files with shredded variants have significantly worse performance when working with the variant structs than unshredded.
Avro is best for the variant data, though all operations will have to read the entire file, operations on other columns are (as expected) slower.
Filtering is the operation which is slow. Projecting on an variant column, shredded or unshredded, is as fast as projecting normal parquet column.

I'm not reaching any conclusion why this is the case. I am looking at improving the performance of reconstructing string fields in parquet-java as those benchmark show needless byte-string-byte conversion. For the iceberg benchmark and layers below, I think knowing where issues like is enough of a change.

Writeup

See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.

steveloughran · 2026-04-01T10:09:59Z

@rashworld-max still a WiP I'm afraid. Need to know I'm measuring the right thing. Also I can't tell from your profile whether or not you are a human.

Fixes apache#15628 Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). The benchmarks don't show any surprises, which is good

Vectorized parquet read disabled. Results (1M rows, 10 files, vectorization disabled) ┌──────────────────────┬───────────────────┬─────────────────┬─────────────┐ │ Benchmark │ Unshredded (s/op) │ Shredded (s/op) │ Ratio │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Full read │ 0.969 ±0.152 │ 1.819 ±0.737 │ 1.9x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Projection (id only) │ 0.223 ±0.038 │ 0.273 ±0.121 │ 1.2x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Filter (category=0) │ 0.864 ±0.413 │ 1.574 ±0.164 │ 1.8x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ variant_get($.value) │ 1.351 ±0.070 │ 2.415 ±0.192 │ 1.8x slower │ └──────────────────────┴───────────────────┴─────────────────┴─────────────┘ Analysis 1. Projection now works - selecting just id (skipping variant) is ~4x faster than full read, confirming the variant column is the bottleneck. 2. Shredded is consistently ~1.8-1.9x slower for all operations reading variant data. The shredded reader must reconstruct the variant object from multiple Parquet columns (metadata + value + typed_value per field), which currently costs more than reading a single binary blob. 3. Projection gap is small (0.223 vs 0.273s) — when skipping the variant column entirely, the shredded table is only slightly slower due to marginally more metadata/schema overhead. 4. variant_get doesn't exploit shredding — extracting a single field from shredded data (2.415s) is slower than from unshredded (1.351s), meaning the reader isn't short-circuiting to read just the typed Parquet column. 5. Filter provides no file-skipping — category values 0-9 are uniformly distributed across all files, so every file must be read regardless. Filter times are close to full read times minus some row-level filtering benefit. The key takeaway: the current read path doesn't yet take advantage of shredding optimizations (column pruning within variants, predicate pushdown to typed columns). These benchmarks provide a baseline to measure improvements as those optimizations are implemented.

@VisibleForTesting

* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data

* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.

Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;

Selecting only column ID restores general parquet performance

Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports

Note that the logs will show the localfs iostats in teardown, which includes bytes read.

spark benchmark now also writes a file, so renamed. shows that shredded write is faster than avro or unshredded, presumably as less data is written. promote various setupBenchmark variables to fields to ease this.

doing a parquet benchmark in sync with this, so aligning

+ restore more iterations and larger files now debugging is done

* no dataframe/rdd operations, everything is a single SQL statement * avro listed first in the parameterized benchmark * shredding/unshredding files correctly set up.

rdblue · 2026-04-10T17:10:18Z


-class ParquetVariantUtil {
+@VisibleForTesting
+public final class ParquetVariantUtil {


Is it possible to relocate the tests rather than expose this? We can do this, but generally prefer not to if we can avoid it.

happy to do that...I have done it in the parquet PR

so ParquetVariantUtil can revert to being package private

steveloughran · 2026-04-14T20:05:11Z

I think this is ready for review. I've got the initial results and it's good for PRs like #3477 to be able to before/after benchmarks.

More stuff can go in later; I've outlined them in my report. Equality deletes would be a fun one

No pathologically bad numbers seen here. Arrays are anonymous so there's no field lookup to add a cost.

manuzhang · 2026-04-16T15:37:32Z

+   */
+  private long materializeNonEmpty(String operation, Dataset<?> ds) {
+    LOG.info("{} table={}", operation, tableType);
+    final long count = ds.count();


Spark doesn't need to evaluate projection to count records.

I needed something to do the entire compute and count() worked. Otherwise it's evaluate every row and feed to a black hole. What do you prefer?

Spark can count records without evaluating projection so it's not really testing the projection here.

it seemed to work, but I will get and discard each record instead

Rashworld-max

manuzhang · 2026-04-16T15:40:29Z

+      "variant_get(nested, '$.varcategory', 'int')";
+
+  /** Get the ID field from inside the variant: {@value}. */
+  private static final String VARIANT_GET_NESTED_ID = "variant_get(nested, '$.varid', 'int')";


should this be int64 as in the comments above?

will review

TODO: create an issue

…oups it is much slower. reason? many more spark tasks running and the overhead of that.

benchmark changes of 7d4ec4823a

github-actions Bot added the core label Mar 13, 2026

steveloughran changed the title ~~Add JMH benchmarks for Variants~~ Core: Add JMH benchmarks for Variants Mar 13, 2026

steveloughran force-pushed the pr/benchmark-variant branch from 7c4f806 to 2be00b9 Compare March 13, 2026 21:51

steveloughran marked this pull request as draft March 16, 2026 16:52

steveloughran mentioned this pull request Mar 16, 2026

Improve benchmark docs page coverage and formatting #15623

Open

github-actions Bot added spark parquet labels Mar 20, 2026

steveloughran closed this Mar 23, 2026

steveloughran reopened this Mar 24, 2026

steveloughran changed the title ~~Core: Add JMH benchmarks for Variants~~ Core, Spark: Add JMH benchmarks for Variants Mar 24, 2026

This was referenced Mar 24, 2026

Spark: Support writing shredded variant in Iceberg-Spark #14297

Merged

Variant Data Type Support #10392

Open

GH-3451. Add a JMH benchmark for variants apache/parquet-java#3452

Open

rashworld-max approved these changes Mar 31, 2026

View reviewed changes

steveloughran added 15 commits April 1, 2026 13:54

Benchmarks based on shred/unshred

8aa8c52

* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data

Spotless and a little bit of code cleaning

ba86624

variant serialization benchmark improvements

4b8c014

* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.

Avro table added alongside parquet

5504ab0

Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;

Projection to remove avro advantage

8f2d5df

Selecting only column ID restores general parquet performance

Benchmark tuning

81ef03a

Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports

Benchmark doc improvements

a125bf8

spotless

fc17402

turn off parquet gzip just to remove gzip interference

d43766b

ongoing work.

76757fa

Note that the logs will show the localfs iostats in teardown, which includes bytes read.

speeding up test creation by only creating one table per iteration

76f0861

ongoing tuning

470df37

forking benchmark jvms

69669aa

steveloughran added 8 commits April 1, 2026 13:54

benchmark tuning for report

6fb4fc0

benchmark tuning for report

c3f742f

IcebergSourceVariantIOBenchmark

5b0156d

spark benchmark now also writes a file, so renamed. shows that shredded write is faster than avro or unshredded, presumably as less data is written. promote various setupBenchmark variables to fields to ease this.

changing field names and some other parts

4e83870

doing a parquet benchmark in sync with this, so aligning

ongoing work

9030e8d

Full cross-matrix of benchmark operations

df02fe7

+ restore more iterations and larger files now debugging is done

about to give copilot control

44c9618

Benchmark improvements: SQL everywhere

25c6f29

* no dataframe/rdd operations, everything is a single SQL statement * avro listed first in the parameterized benchmark * shredding/unshredding files correctly set up.

steveloughran force-pushed the pr/benchmark-variant branch from 70c69f8 to 25c6f29 Compare April 10, 2026 15:23

rdblue reviewed Apr 10, 2026

View reviewed changes

steveloughran added 2 commits April 14, 2026 13:21

Moved IcebergSourceVariantIOBenchmark as suggested

9e4c2a6

so ParquetVariantUtil can revert to being package private

Spotless

a75b0b6

steveloughran marked this pull request as ready for review April 14, 2026 20:03

Add arrays to the mix.

ecc99d8

No pathologically bad numbers seen here. Arrays are anonymous so there's no field lookup to add a cost.

manuzhang reviewed Apr 27, 2026

View reviewed changes

Merge branch 'main' into pr/benchmark-variant

f3e1b39

rashworld-max approved these changes Apr 29, 2026

View reviewed changes

steveloughran added 3 commits April 30, 2026 15:24

Bug: JMH classpath wrong.

1d8f7b6

TODO: create an issue

Variant benchmark reducing iterations on benchmark as with more rowgr…

07c30d2

…oups it is much slower. reason? many more spark tasks running and the overhead of that.

JMH benchmark: optimise for rowgroup skipping on category

afb9a12

github-actions Bot added the build label Apr 30, 2026

Set feature: Using IN rather than =

51184f8

benchmark changes of 7d4ec4823a

Conversation

steveloughran commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

core:VariantSerializationBenchmark

spark-4.1:IcebergSourceVariantReadBenchmark

Writeup

Uh oh!

steveloughran commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steveloughran commented Mar 13, 2026 •

edited

Loading

steveloughran commented Apr 1, 2026 •

edited

Loading

steveloughran Apr 12, 2026 •

edited

Loading