Core, Spark: Add JMH benchmarks for Variants#15629
Core, Spark: Add JMH benchmarks for Variants#15629steveloughran wants to merge 31 commits intoapache:mainfrom
Conversation
7c4f806 to
2be00b9
Compare
|
@rashworld-max still a WiP I'm afraid. Need to know I'm measuring the right thing. Also I can't tell from your profile whether or not you are a human. |
Fixes apache#15628 Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). The benchmarks don't show any surprises, which is good
Vectorized parquet read disabled. Results (1M rows, 10 files, vectorization disabled) ┌──────────────────────┬───────────────────┬─────────────────┬─────────────┐ │ Benchmark │ Unshredded (s/op) │ Shredded (s/op) │ Ratio │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Full read │ 0.969 ±0.152 │ 1.819 ±0.737 │ 1.9x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Projection (id only) │ 0.223 ±0.038 │ 0.273 ±0.121 │ 1.2x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ Filter (category=0) │ 0.864 ±0.413 │ 1.574 ±0.164 │ 1.8x slower │ ├──────────────────────┼───────────────────┼─────────────────┼─────────────┤ │ variant_get($.value) │ 1.351 ±0.070 │ 2.415 ±0.192 │ 1.8x slower │ └──────────────────────┴───────────────────┴─────────────────┴─────────────┘ Analysis 1. Projection now works - selecting just id (skipping variant) is ~4x faster than full read, confirming the variant column is the bottleneck. 2. Shredded is consistently ~1.8-1.9x slower for all operations reading variant data. The shredded reader must reconstruct the variant object from multiple Parquet columns (metadata + value + typed_value per field), which currently costs more than reading a single binary blob. 3. Projection gap is small (0.223 vs 0.273s) — when skipping the variant column entirely, the shredded table is only slightly slower due to marginally more metadata/schema overhead. 4. variant_get doesn't exploit shredding — extracting a single field from shredded data (2.415s) is slower than from unshredded (1.351s), meaning the reader isn't short-circuiting to read just the typed Parquet column. 5. Filter provides no file-skipping — category values 0-9 are uniformly distributed across all files, so every file must be read regardless. Filter times are close to full read times minus some row-level filtering benefit. The key takeaway: the current read path doesn't yet take advantage of shredding optimizations (column pruning within variants, predicate pushdown to typed columns). These benchmarks provide a baseline to measure improvements as those optimizations are implemented.
* Made ParquetVariantUtil public @VisibleForTesting * includes SQL query for filtering on a variant field Dev setup imply shedding is slower all round, at least with test data
* deep nesting generates deeply nested structures * benchmark cost of construction alone Shows that there's a penalty for deep objects, inevitably due to the hashtable. But anything else is really complex, and deep is niche.
Shows least compression, best performance.l Now tuning specs for all queries to only return row ID, avoiding reconstruction costs (should boost parquet perf all round;
Selecting only column ID restores general parquet performance
Queries and data to emphasise value of columnar data formats * variant to add a string based on category (so 20 values) * always filter on ID before count * benchmark names tuned to look better in reports
Note that the logs will show the localfs iostats in teardown, which includes bytes read.
spark benchmark now also writes a file, so renamed. shows that shredded write is faster than avro or unshredded, presumably as less data is written. promote various setupBenchmark variables to fields to ease this.
doing a parquet benchmark in sync with this, so aligning
+ restore more iterations and larger files now debugging is done
* no dataframe/rdd operations, everything is a single SQL statement * avro listed first in the parameterized benchmark * shredding/unshredding files correctly set up.
70c69f8 to
25c6f29
Compare
|
|
||
| class ParquetVariantUtil { | ||
| @VisibleForTesting | ||
| public final class ParquetVariantUtil { |
There was a problem hiding this comment.
Is it possible to relocate the tests rather than expose this? We can do this, but generally prefer not to if we can avoid it.
There was a problem hiding this comment.
happy to do that...I have done it in the parquet PR
so ParquetVariantUtil can revert to being package private
|
I think this is ready for review. I've got the initial results and it's good for PRs like #3477 to be able to before/after benchmarks. More stuff can go in later; I've outlined them in my report. Equality deletes would be a fun one |
No pathologically bad numbers seen here. Arrays are anonymous so there's no field lookup to add a cost.
| */ | ||
| private long materializeNonEmpty(String operation, Dataset<?> ds) { | ||
| LOG.info("{} table={}", operation, tableType); | ||
| final long count = ds.count(); |
There was a problem hiding this comment.
Spark doesn't need to evaluate projection to count records.
There was a problem hiding this comment.
I needed something to do the entire compute and count() worked. Otherwise it's evaluate every row and feed to a black hole. What do you prefer?
There was a problem hiding this comment.
Spark can count records without evaluating projection so it's not really testing the projection here.
There was a problem hiding this comment.
it seemed to work, but I will get and discard each record instead
| "variant_get(nested, '$.varcategory', 'int')"; | ||
|
|
||
| /** Get the ID field from inside the variant: {@value}. */ | ||
| private static final String VARIANT_GET_NESTED_ID = "variant_get(nested, '$.varid', 'int')"; |
There was a problem hiding this comment.
should this be int64 as in the comments above?
TODO: create an issue
…oups it is much slower. reason? many more spark tasks running and the overhead of that.
benchmark changes of 7d4ec4823a
Fixes #15628
core:VariantSerializationBenchmark
Separate benchmarks for
Variables are:
spark-4.1:IcebergSourceVariantReadBenchmark
Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.
Key findings:
I'm not reaching any conclusion why this is the case. I am looking at improving the performance of reconstructing string fields in parquet-java as those benchmark show needless byte-string-byte conversion. For the iceberg benchmark and layers below, I think knowing where issues like is enough of a change.
Writeup
See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.