Skip to content

Spec: Add content stats to spec#14234

Open
nastra wants to merge 8 commits intoapache:mainfrom
nastra:content-stats-spec
Open

Spec: Add content stats to spec#14234
nastra wants to merge 8 commits intoapache:mainfrom
nastra:content-stats-spec

Conversation

@nastra
Copy link
Copy Markdown
Contributor

@nastra nastra commented Oct 2, 2025

These spec changes are related to the content stats proposal. Note that the root content_stats field hasn't been added yet, as its ID and position in the manifest currently depends on the v4 re-structuring at the manifest layer.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 2, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 2, 2025
@nastra nastra added not-stale and removed stale labels Nov 3, 2025
@nastra nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07
@nastra nastra requested a review from danielcweeks November 24, 2025 14:08
@nastra nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment on lines +723 to +724
| avg_value_count | `int` | 4 | false | The avg value count for variable-length types (string/binary) |
| max_value_count | `long` | 5 | false | The max value count for variable-length types (string/binary) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be avg_value_length and max_value_length. I don't think this represents a count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

Copy link
Copy Markdown
Contributor

@rdblue rdblue Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, did you confirm whether max is used by Spark?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced.

#### Content Stats
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location is fine for this content. However, I think that it should be merged with the section above and it should cover stats in v3 and v4. The purpose is to direct people to the section on file granularity stats and have all of the information here, rather than needing to know whether to go to the section above for v3 and "Content Stats" for v4.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

`stats_field_id = stats_struct_id + offset`

**Metadata columns (reserved table field ids)**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should have a space reserved for metadata fields. I think that the only metadata fields that need stats are _row_id and _last_updated_sequence_number. The other reserved fields are not stored in data files and are assigned IDs so that we can expose them as metadata columns in reads or changelog scans.

Both _row_id and _last_updated_sequence_number are useful to identify changes and are useful to include in stats. But because they are inherited values, we need to be careful and specify how to handle them. As long as we have those called out specifically, I think it makes sense to assign their stats specific ID ranges instead of reserving a big range (2,400+ stats structs?) for all reserved IDs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should store _row_id stats starting at 9,600 and _last_updated_sequence_number starting at 9,800. We can assign any other metadata columns specific ranges as well, if and when we need them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealing with all of the reserved field IDs was a big discussion point in the design doc, hence why I came up with a calculation to generally deal with reserved field IDs. Also having custom defined ID ranges for a particular subset of field IDs seems brittle to me, because we'd have to define a new range whenever a new reserved field ID is added anywhere to the schema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to a static table with this description:

Reserved metadata fields must use the stats ID ranges from the following table. Stats for metadata fields not in the table are not tracked.

We can still have stats for reserved fields, but I think it is better to reserve the stats ranges specifically as well.

Comment thread format/spec.md
The formula is defined as:
`stats_struct_id = stats_space_field_id_start_for_metadata_fields + (num_supported_stats_per_column * (num_reserved_field_ids - (Integer.MAX_VALUE - table_field_id)))`

Valid data field ids support stats structs with ids from `10_000` through `200_010_000`, so the highest supported **data** field id is `1_000_000`.
Copy link
Copy Markdown
Contributor

@rdblue rdblue May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that this puts an arbitrary limit on table field IDs that was not previously in the spec. It is easy to interpret this as "there can be no table field ID higher than 1,000,000".

What we actually want to do is to allocate a range of stats field IDs that correspond to the first 1,000,000 (ish) fields. That way, we don't impose a new limit on the number of fields in a table. We just impose a limit on the fields that can be represented in the new stats structure.

To do that, I think we should change this to say that IDs in metadata files between 10,000 (inclusive) and 200,000,000 (exclusive) are reserved for column stats structs. If a table field has an ID that would be outside of that range, then it cannot store stats but is still valid.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this in nastra#139 to:

IDs in the range 10_000 (inclusive) to 200_000_000 (exclusive) are reserved for column stats structs in content_stats. Stats for table fields with IDs outside that range cannot be stored in content_stats.

I also removed the line about the "highest supported data field id".

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Per-column metrics used for filtering and planning are stored at **file** granularity on the `data_file` struct.
In v3, implementations use maps such as `value_counts`, `lower_bounds`, and `upper_bounds`, keyed by column id, with bounds serialized as binary (see note 1 under [Data File Fields](#data-file-fields)).
Iceberg v4 adds the optional `content_stats` struct, which holds the same *logical* metrics (without `column_sizes` as that was deprecated) using nested structs and typed bounds (see [Content Stats](#content-stats)).
Stats for primitive types and Geometry / Geography / Variant are supported. Stats for nested types (`struct`, `list`, `map`) are not supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is confusing and probably unnecessary. I think that this is trying to state that column ranges (lower/upper bounds) are tracked for primitive columns. We don't need to disallow value count and null count for nested types, although I don't think that we would encourage people to produce them.

This also mixes in bounds for variant, geometry, and geography but that's the very next section. We should make sure that section still makes sense with content stats, but the point is not that bounds are supported, it is that bounds means something different: for geo types bounds are a bounding box, and for variant bounds are lower bounds for fields within the variant. But as I said, it is the next section so it doesn't need to be mentioned here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In nastra#139, I handled this by specifying that lower and upper bounds can be stored for all primitives and variant. That way maps, lists, and structs are excluded from lower and upper bounds, but value and null count are allowed.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
###### ID assignment for stats fields

ID assignment follows a deterministic mapping from the **table ID space** to the **stats ID space**, where a given field ID from the **table ID space** gets an ID assigned from the **stats ID space** for each field-level statistics struct.
Each field-level statistic listed in the [field stats types section](#field-stats-types) has a fixed offset. Its stats field ID is the enclosing stats struct's ID plus that offset.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this section should not focus on ID assignment, but rather how to construct the struct that holds a field's statistics. To me, that starts with a short description: the stats are assigned an ID range based on the table field ID. The fields of the struct are assigned by adding an offset to the beginning of that range. Then we should have a table with the offsets.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md

###### Name assignment for `content_stats` fields

Each nested stats struct is a **child field** of the root `content_stats` struct. Its **name** is the numerical string of the table column's field id (for example id `103` uses the name `"103"`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this requirement is a good idea. We are going to rewrite this when reading to produce a struct that uses the current field name. Names don't matter, so we should just use the current name for readability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to:

Content status must be resovled by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
| 2 | 10_400 |
| 5 | 11_000 |
| 100 | 30_000 |
| 1_000_000 | 200_010_000 |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more helpful to have a full example of an int field with an ID and the corresponding full struct type.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@nastra nastra force-pushed the content-stats-spec branch from ee3f024 to 94c221a Compare May 7, 2026 10:42
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 8, 2026

@nastra, I opened nastra#139 with a lot of edits to simplify and apply some of the edits I noted. I also cleaned up the relationship to the existing variant and geo bounds, although variant still has some outstanding work (how to shred).

I'm going to move on to the section for how to maintain stats in manifest files next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

not-stale Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants