Spark: Support writing shredded variant in Iceberg-Spark by aihuaxu · Pull Request #14297 · apache/iceberg

aihuaxu · 2025-10-11T21:02:14Z

What it does

This PR adds support for writing shredded variants from Spark into Iceberg tables. Variant shredding extracts commonly-typed fields from semi-structured VARIANT columns into dedicated typed Parquet columns (typed_value), enabling predicate pushdown, column pruning, and better read performance.

Key design: Buffered schema inference

Because the shredded schema isn't known at Spark's planning time (DSv2 creates DataWriterFactory on the driver before seeing data), the PR uses a lazy/buffered approach:

A new BufferedFileAppender buffers the first N rows.
A VariantShreddingAnalyzer analyzes the buffered rows to infer the shredded schema.
Once the schema is determined, the real Parquet writer is created and the buffer is flushed.

Shredding heuristics

Most common type wins: for each field, the type that appears most frequently becomes the typed_value type.
Frequency pruning: fields appearing in less than 10% of sampled rows are dropped.
Field cap: maximum 300 shredded fields.
Deterministic tie-breaking: explicit priority maps to ensure stable schemas regardless of record order.
Decimal special handling: precision/scale must be consistent; if not, decimal is not shredded.
Null fields are skipped: JSON null values ({"field": null}) don't create shredded columns.

Co-Authored by: @nssalian

aihuaxu · 2025-10-15T01:57:15Z

@amogh-jahagirdar @Fokko @huaxingao Can you help take a look at this PR and if we have better approach for this?

aihuaxu · 2025-10-21T04:59:31Z

cc @RussellSpitzer, @pvary and @rdblue Seems it's better to have the implementation with new File Format proposal but want to check if this is acceptable approach as an interim solution or you see a better alternative.

pvary · 2025-10-21T10:01:30Z

@aihuaxu: Don't we want to do the same but instead of wrapping the ParquetWriter, we could wrap the DataWriter. The schema would be created near the SparkWrite.WriterFactory and it would be easier to move to the new API when it is ready. The added benefit would be that when other formats implement the Variant, we could reuse the code.

Would this be prohibitively complex?

huaxingao · 2025-10-21T18:32:50Z

In Spark DSv2, planning/validation happens on the driver. BatchWrite#createBatchWriterFactory runs on the driver and returns a DataWriterFactory that is serialized to executors. That factory must already carry the write schema the executors will use when they create DataWriters.

For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during createBatchWriterFactory would mean starting a second job inside planning, which is not how DSv2 is intended to work.

Because of that, the current proposed Spark approach is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints.

pvary · 2025-10-22T08:47:11Z

Thanks for the explanation, @huaxingao! I see several possible workarounds for the DataWriterFactory serialization issue, but I have some more fundamental concerns about the overall approach.
I believe shredding should be driven by future reader requirements rather than by the actual data being written. Ideally, it should remain relatively stable across data files within the same table and originate from a writer job configuration—or even better, from a table-level configuration.

Even if we accept that the written data should dictate the shredding logic, Spark’s implementation—while dependent on input order—is at least somewhat stable. It drops rarely used fields, handles inconsistent types, and limits the number of columns.
I understand this is only a PoC implementation for shredding, but I’m concerned that the current simplifications make it very unstable. If I’m interpreting correctly, the logic infers the type from the first occurrence of each field and creates a column for every field. This could lead to highly inconsistent column layouts within a table, especially in IoT scenarios where multiple sensors produce vastly different data.
Did I miss anything?

aihuaxu · 2025-10-24T16:28:26Z

Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for explaining how the writer works in Spark.

Regarding the concern about unstable schemas, Spark's approach makes sense:

If a field appears consistently with a consistent type, create both value and typed_value
If a field appears with inconsistent types, create only value
Drop fields that occur in less than 10% of sampled rows
Cap the total at 300 fields (counting value and typed_value separately)

We could implement similar heuristics. Additionally, making the shredded schema configurable would allow users to choose which fields to shred at write time based on their read patterns.

For this POC, I'd like any feedback on whether there are any significant high-level design options to consider first and if this approach is acceptable. This seems hacky. I may have missed big picture on how the writers work across Spark + Iceberg + Parquet and we may have better way.

github-actions · 2025-11-24T00:19:23Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

Tishj · 2025-11-30T21:28:05Z

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

yguy-ryft · 2025-12-24T17:43:26Z

This PR is super exciting!
Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding.
Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding.
Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics.
Would love to hear your thoughts!

aihuaxu · 2026-01-09T19:55:50Z

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

That is correct.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

I'm still trying to improve the heuristics to use the most common one as shredding type rather than the first one and probably cap the number of shredded fields, etc. but it doesn't need 100% consistent type to be shredded.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

Yeah. I think that makes sense for advanced user to determine the shredded schema since they may know the read pattern.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

Why is DECIMAL special here? If we determine DECIMAL4 to be shredded type, then we may shred as DECIMAL4 or not shred if they cannot fit in DECIMAL4, right?

aihuaxu · 2026-01-09T19:58:25Z

This PR is super exciting! Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding. Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding. Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics. Would love to hear your thoughts!

Yeah. I'm also thinking of that too. Will address that separately. Basically based on read pattern, the user can specify the shredding schema.

gkpanda4

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

aihuaxu · 2026-01-15T19:39:27Z

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

I addressed this null value check in VariantShreddingAnalyzer.java instead. If it's NULL, then we will not add the shredded field.

qlong · 2026-04-30T16:35:17Z

I looked at VariantShreddingAnalyzer and SparkVariantShreddingAnalyzer, implementation looks good, just minor nit.

The current strategy is to shred aggressively, including fields with multiple incompatible types by picking the most common one. When a field has mixed types, the shredded typed_value is only populated for rows whose value matches the chosen type; other rows still carry the full binary value. This means bounded column reads are not available for mixed-type fields, and the performance gain relative to the added column overhead is not clear.

I am not suggesting the current design is flawed. Shredding parameters like MIN_FIELD_FREQUENCY and MAX_SHREDDED_FIELDS can be tuned or new strategies introduced in follow-ups without breaking existing files. But more performance testing on real query patterns would help inform whether these thresholds need to be user-tunable. I would not block merge on this, assuming the community agrees.

qlong · 2026-04-30T15:29:33Z

+  private static final String VALUE = "value";
+  private static final String ELEMENT = "element";
+  private static final double MIN_FIELD_FREQUENCY = 0.10;
+  private static final int MAX_SHREDDED_FIELDS = 300;


Maybe we can make those shredding params configurable in the future, or after more performance testing?

qlong · 2026-04-30T15:37:12Z

+
+  private static class PathNode {
+    private final String fieldName;
+    private final Map<String, PathNode> objectChildren = Maps.newTreeMap();


Performance nit: this map is on the hot path, tree map requires string comparison. Maybe change it to hashmap since ordering is not important until after pruning createObjectTypedValue? We do sort there:

// createObjectTypedValue: sort once here private static Type createObjectTypedValue(PathNode node) { List<PathNode> sorted = Lists.newArrayList(node.objectChildren.values()); sorted.sort(Comparator.comparing(child -> child.fieldName)); ... }

nssalian · 2026-04-30T17:20:20Z

Thanks for the reviews @steveloughran @qlong - all great points. I'd like to land this PR as-is and I can follow up with a PR to address these since the PR is already large. I summarized here:

Configurable shredding parameters for workload tuning
TreeMap to HashMap optimization in PathNode, sort once at schema build time
TIE_BREAK_PRIORITY javadoc + reorder STRING above BINARY
Debug logging in buildShreddedAppender
Switch statement in ParquetFormatModel.set()
Docs: qualify query performance claim

None of these affect correctness. Happy to open the follow-up immediately after merge if there is agreement.

qlong

I focused on shredding analyzer and it looks good to me

nssalian · 2026-05-05T16:39:02Z

Will address @huaxingao's comments in an upcoming commit. I also realized that this PR was originally only on Spark 4.1. I'll can add the changes to Spark 4.0 too. Or should I do that in a follow up PR after this is merged?
The sequence would be

This PR merges
I'll follow up with the items here: Spark: Support writing shredded variant in Iceberg-Spark #14297 (comment)
PR for Spark 4.0 with all the above changes.
@aihuaxu @pvary let me know.

huaxingao · 2026-05-06T05:38:15Z

+
+      GroupType typedValue = variantGroup.getType("typed_value").asGroupType();
+      assertThat(typedValue.containsField("a")).isTrue();
+      assertThat(typedValue.containsField("b")).isTrue();


The test verifies the shredded schema and the data round-trip. Should we also verify the data is in the typed columns to prove the data is really shredded?

Updated the test with check for the data in the typed_value

huaxingao · 2026-05-06T19:05:15Z

+    // Verify data is in typed columns by reading raw Parquet groups
+    try (ParquetReader<Group> rawReader =
+        ParquetReader.builder(
+                new GroupReadSupport(), new org.apache.hadoop.fs.Path(outputFile.location()))


nit: import org.apache.hadoop.fs.Path. You can fix this in the followup PR.

I saw that in another test too here and the TestParquetDataWriter has import java.nio.file.Path so it would conflict. I'm not sure if there is a better way.

huaxingao

LGTM

huaxingao · 2026-05-06T20:14:46Z

Thanks @aihuaxu @nssalian for the PR! Thanks every one for the review!

nssalian · 2026-05-07T14:32:50Z

I'll open a follow-up PR to address the pending items here after @pvary's backport PR goes in for Spark 4.0.

…6241) backports #14297

github-actions Bot added spark parquet labels Oct 11, 2025

aihuaxu force-pushed the spark-write-iceberg-variant branch from 16b7a09 to dc4f72e Compare October 11, 2025 21:03

aihuaxu marked this pull request as ready for review October 11, 2025 21:15

aihuaxu force-pushed the spark-write-iceberg-variant branch 3 times, most recently from 97851f0 to b87e999 Compare October 13, 2025 16:47

huaxingao reviewed Oct 21, 2025

View reviewed changes

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java Outdated

deniskuzZ mentioned this pull request Oct 31, 2025

HIVE-29287: Iceberg: [V3] Variant Shredding support apache/hive#6152

Merged

github-actions Bot added the stale label Nov 24, 2025

github-actions Bot removed the stale label Dec 1, 2025

gkpanda4 reviewed Jan 14, 2026

View reviewed changes

aihuaxu force-pushed the spark-write-iceberg-variant branch 2 times, most recently from 2e81d79 to 7e1b608 Compare January 15, 2026 19:35

aihuaxu force-pushed the spark-write-iceberg-variant branch 4 times, most recently from 7c805f6 to 67dbe97 Compare January 15, 2026 22:50

qlong reviewed Apr 30, 2026

View reviewed changes

qlong approved these changes Apr 30, 2026

View reviewed changes

aihuaxu requested review from RussellSpitzer and huaxingao May 4, 2026 19:01