Skip to content

fix: respect Parquet column order when pruning#22118

Open
kumarUjjawal wants to merge 1 commit into
apache:mainfrom
kumarUjjawal:fix/parquet_order_prunning
Open

fix: respect Parquet column order when pruning#22118
kumarUjjawal wants to merge 1 commit into
apache:mainfrom
kumarUjjawal:fix/parquet_order_prunning

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Parquet min/max statistics are only safe to use when DataFusion knows the order used to write them. If DataFusion uses stats with the wrong order, it can skip data that should be read and return wrong results.

What changes are included in this PR?

This PR checks Parquet column order before using min/max statistics for file, row group, and page pruning.

It also skips unsafe deprecated min/max statistics for columns whose natural order is not signed.

For files without column order metadata, DataFusion now uses a conservative fallback. It only uses min/max pruning for signed-order columns.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes.

Queries over older or external Parquet files without column order metadata may scan more data for unsigned integers, strings, binary values, and booleans. This is intentional to avoid wrong results.

@github-actions github-actions Bot added documentation Improvements or additions to documentation datasource Changes to the datasource crate labels May 12, 2026
}
}

#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some way to reduce the boiler plate in these tests? Also are we sure they are all necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet Statistics Pruning Ignores ColumnOrder, resulting in potentially incorrect statistics

2 participants