fix: scan all partitions and use FilterExec for pre-scan#14
Merged
anoop-narang merged 1 commit intomainfrom Mar 24, 2026
Merged
fix: scan all partitions and use FilterExec for pre-scan#14anoop-narang merged 1 commit intomainfrom
anoop-narang merged 1 commit intomainfrom
Conversation
The pre-scan previously called execute(0) on the DataSourceExec, reading only the first partition's file group and missing valid keys from the rest of the dataset. This was a correctness bug — selectivity calculations and valid_key collection were based on partial data. Wrap the pre-scan as CoalescePartitionsExec → FilterExec → DataSourceExec: - CoalescePartitionsExec merges all partitions into a single stream - FilterExec evaluates the predicate per partition (DataFusion's physical optimizer pushes it into the Parquet reader for pruning) - The stream yields only matching rows — no manual evaluate_filters This also removes prescan_filters, evaluate_filters, and the manual physical filter compilation from SearchParams, simplifying the code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
execute(0)on the DataSourceExec, reading only the first partition's file group and missing valid keys from the rest of the dataset. Selectivity calculations and valid_key collection were based on partial data.CoalescePartitionsExec → FilterExec → DataSourceExec. DataFusion's physical optimizer pushes the predicate from FilterExec into the Parquet reader for pruning. No more manualevaluate_filtersorprescan_filters.scan_provider.scan()since FilterExec handles it via the optimizer.prescan_filters,physical_filters,evaluate_filters, and manual physical filter compilation from SearchParams.Test plan