Spec: Add content stats to spec by nastra · Pull Request #14234 · apache/iceberg

nastra · 2025-10-02T09:43:34Z

These spec changes are related to the content stats proposal. Note that the root content_stats field hasn't been added yet, as its ID and position in the manifest currently depends on the v4 re-structuring at the manifest layer.

github-actions · 2025-11-02T00:19:02Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

danielcweeks · 2026-02-04T16:06:18Z

+| avg_value_count  | `int`               | 4                                   | false    | The avg value count for variable-length types (string/binary)                                                                                                                  |
+| max_value_count  | `long`              | 5                                   | false    | The max value count for variable-length types (string/binary)                                                                                                                  |


Should this be avg_value_length and max_value_length. I don't think this represents a count.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

@nastra, did you confirm whether max is used by Spark?

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

rdblue · 2026-04-01T23:18:41Z


 When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced.

+#### Content Stats


This location is fine for this content. However, I think that it should be merged with the section above and it should cover stats in v3 and v4. The purpose is to direct people to the section on file granularity stats and have all of the information here, rather than needing to know whether to go to the section above for v3 and "Content Stats" for v4.

rdblue · 2026-05-06T17:14:20Z

+
+`stats_field_id = stats_struct_id + offset`
+
+**Metadata columns (reserved table field ids)**


I don't think that we should have a space reserved for metadata fields. I think that the only metadata fields that need stats are _row_id and _last_updated_sequence_number. The other reserved fields are not stored in data files and are assigned IDs so that we can expose them as metadata columns in reads or changelog scans.

Both _row_id and _last_updated_sequence_number are useful to identify changes and are useful to include in stats. But because they are inherited values, we need to be careful and specify how to handle them. As long as we have those called out specifically, I think it makes sense to assign their stats specific ID ranges instead of reserving a big range (2,400+ stats structs?) for all reserved IDs.

I think that we should store _row_id stats starting at 9,600 and _last_updated_sequence_number starting at 9,800. We can assign any other metadata columns specific ranges as well, if and when we need them.

Dealing with all of the reserved field IDs was a big discussion point in the design doc, hence why I came up with a calculation to generally deal with reserved field IDs. Also having custom defined ID ranges for a particular subset of field IDs seems brittle to me, because we'd have to define a new range whenever a new reserved field ID is added anywhere to the schema

I updated this to a static table with this description:

Reserved metadata fields must use the stats ID ranges from the following table. Stats for metadata fields not in the table are not tracked.

We can still have stats for reserved fields, but I think it is better to reserve the stats ranges specifically as well.

rdblue · 2026-05-06T21:26:48Z

+The formula is defined as:
+`stats_struct_id = stats_space_field_id_start_for_metadata_fields + (num_supported_stats_per_column * (num_reserved_field_ids - (Integer.MAX_VALUE - table_field_id)))`
+
+Valid data field ids support stats structs with ids from `10_000` through `200_010_000`, so the highest supported **data** field id is `1_000_000`.


I don't like that this puts an arbitrary limit on table field IDs that was not previously in the spec. It is easy to interpret this as "there can be no table field ID higher than 1,000,000".

What we actually want to do is to allocate a range of stats field IDs that correspond to the first 1,000,000 (ish) fields. That way, we don't impose a new limit on the number of fields in a table. We just impose a limit on the fields that can be represented in the new stats structure.

To do that, I think we should change this to say that IDs in metadata files between 10,000 (inclusive) and 200,000,000 (exclusive) are reserved for column stats structs. If a table field has an ID that would be outside of that range, then it cannot store stats but is still valid.

I updated this in nastra#139 to:

IDs in the range 10_000 (inclusive) to 200_000_000 (exclusive) are reserved for column stats structs in content_stats. Stats for table fields with IDs outside that range cannot be stored in content_stats.

I also removed the line about the "highest supported data field id".

rdblue · 2026-05-06T21:33:50Z

+Per-column metrics used for filtering and planning are stored at **file** granularity on the `data_file` struct.
+In v3, implementations use maps such as `value_counts`, `lower_bounds`, and `upper_bounds`, keyed by column id, with bounds serialized as binary (see note 1 under [Data File Fields](#data-file-fields)).
+Iceberg v4 adds the optional `content_stats` struct, which holds the same *logical* metrics (without `column_sizes` as that was deprecated) using nested structs and typed bounds (see [Content Stats](#content-stats)).
+Stats for primitive types and Geometry / Geography / Variant are supported. Stats for nested types (`struct`, `list`, `map`) are not supported.


This statement is confusing and probably unnecessary. I think that this is trying to state that column ranges (lower/upper bounds) are tracked for primitive columns. We don't need to disallow value count and null count for nested types, although I don't think that we would encourage people to produce them.

This also mixes in bounds for variant, geometry, and geography but that's the very next section. We should make sure that section still makes sense with content stats, but the point is not that bounds are supported, it is that bounds means something different: for geo types bounds are a bounding box, and for variant bounds are lower bounds for fields within the variant. But as I said, it is the next section so it doesn't need to be mentioned here.

In nastra#139, I handled this by specifying that lower and upper bounds can be stored for all primitives and variant. That way maps, lists, and structs are excluded from lower and upper bounds, but value and null count are allowed.

rdblue · 2026-05-06T21:46:05Z

+###### ID assignment for stats fields
+
+ID assignment follows a deterministic mapping from the **table ID space** to the **stats ID space**, where a given field ID from the **table ID space** gets an ID assigned from the **stats ID space** for each field-level statistics struct.
+Each field-level statistic listed in the [field stats types section](#field-stats-types) has a fixed offset. Its stats field ID is the enclosing stats struct's ID plus that offset.


I think that this section should not focus on ID assignment, but rather how to construct the struct that holds a field's statistics. To me, that starts with a short description: the stats are assigned an ID range based on the table field ID. The fields of the struct are assigned by adding an offset to the beginning of that range. Then we should have a table with the offsets.

rdblue · 2026-05-06T21:58:57Z

+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats` struct. Its **name** is the numerical string of the table column's field id (for example id `103` uses the name `"103"`).


I don't think that this requirement is a good idea. We are going to rewrite this when reading to produce a struct that uses the current field name. Names don't matter, so we should just use the current name for readability.

I updated this to:

Content status must be resovled by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema.

rdblue · 2026-05-06T22:07:47Z

+| 2                   | 10_400                    |
+| 5                   | 11_000                    |
+| 100                 | 30_000                    |
+| 1_000_000           | 200_010_000               |


I think it would be more helpful to have a full example of an int field with an ID and the corresponding full struct type.

rdblue · 2026-05-08T22:30:18Z

@nastra, I opened nastra#139 with a lot of edits to simplify and apply some of the edits I noted. I also cleaned up the relationship to the existing variant and geo bounds, although variant still has some outstanding work (how to shred).

I'm going to move on to the section for how to maintain stats in manifest files next.

github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025

github-actions Bot added the stale label Nov 2, 2025

nastra added not-stale and removed stale labels Nov 3, 2025

nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07

nastra requested a review from danielcweeks November 24, 2025 14:08

nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32

nastra requested review from amogh-jahagirdar and singhpk234 January 9, 2026 15:21

nastra commented Jan 22, 2026

View reviewed changes