Skip to content

#15510 Parquet: Support row group skipping for shredded variant columns#16133

Draft
steveloughran wants to merge 8 commits intoapache:mainfrom
steveloughran:pr/variant-rowgroups
Draft

#15510 Parquet: Support row group skipping for shredded variant columns#16133
steveloughran wants to merge 8 commits intoapache:mainfrom
steveloughran:pr/variant-rowgroups

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran commented Apr 27, 2026

ParquetMetricsRowGroupFilter.compareVariant() implements comparisons for variants, including NaN, null.

  • New ParquetVariantUtil splitter regexp can't cope with columns called ]. That's OK as normalisation forbids that and empty paths
  • Copilot wrote the tests so it's over-verbose, but thorough.
  • including NaN behaviour and string truncation on max values.

Fixes #15510

Testing notes

Full branch with this, Qlong's api changes and the benchmark is https://github.com/steveloughran/iceberg/tree/pr/variant-rowgroups-benchmark
Does need a matching spark version.

@steveloughran steveloughran marked this pull request as draft April 27, 2026 16:55
@steveloughran
Copy link
Copy Markdown
Contributor Author

ParquetMetricsRowGroupFilter.compareVariant() implements comparisons
for variants, including NaN, null.

* New ParquetVariantUtil splitter regexp can't cope with columns called ].
  That's OK as normalisation forbids that and empty paths
* Copilot wrote the tests so it's over-verbose, but thorough.
* including NaN behaviour and string truncation on max values.
Goal: speedup rather than slow-down
There's no concurrency handling here in the build up of that lazy structure,
BTW...once happy with the design it'll need to be locked down better.
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
+ add set membership probe
+ logging at info

off pr/variant-rowgroups-benchmark
removed the IcebergSourceVariantIOBenchmark changes.
+ adjust ParquetMetricsRowGroupFilter to keep filter complexity below 13.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support row group skipping for shredded variant columns

1 participant