Skip to content

feat(NET-92) Handle missing parquet columns as nulls#59

Open
define-null wants to merge 5 commits intomasterfrom
defnull/net-92-add-default-null-columns-support
Open

feat(NET-92) Handle missing parquet columns as nulls#59
define-null wants to merge 5 commits intomasterfrom
defnull/net-92-add-default-null-columns-support

Conversation

@define-null
Copy link
Contributor

@define-null define-null commented Mar 10, 2026

Contributes: https://linear.app/sqd-ai/issue/NET-92/correctly-propagate-errors-from-the-query-engine#comment-fb1dc348

What is this PR about?

Graceful handling of missing columns in parquet files. When a query requests a field that doesn't exist in the underlying parquet data, the system now returns null values instead of failing.

How does it work?

  • Scan builder accepts a list of default-null columns via with_default_null_columns(). When a projected column is missing from the parquet file, it is injected as a NullArray into the resulting RecordBatch.
  • Table gains a set_nullable() method for declaring which columns may be absent. Currently all field columns are marked nullable via columns() generated by the item_field_selection! macro.
  • ChunkWithDefaults wrapper implements Chunk and transparently attaches default-null column info to every scan_table() call — covering both direct scans and relation lookups.
  • Based on discussion with @tmcgroul and @kalabukdima exclude authorization_list for now from the nullable column list.
  • Added optional tracing support for query crate

Limitations

There is no reliable source of schema information today. The schema may vary within a single dataset and is commonly different across datasets of the same kind. As a result, we cannot precisely declare which columns are truly nullable — instead, all field columns are currently marked as such. This is a temporary mitigation as discussed with @kalabukdima: queries will return nulls for missing columns rather than error, but the proper fix requires a well-defined schema source.

TableReader` trait** (`reader.rs`) — Added `default_null_columns: Option<&HashSet<Name>>` parameter to `read()`.

**`Scan`** (`scan.rs`) — Added `default_null_columns` field and `with_default_null_columns()` builder method. Passes it through to `reader.read()`.

**`ParquetFile::read()`** (`parquet/file.rs`) — In Stage 3, columns in `default_null_columns` that are missing from the parquet schema are skipped instead of erroring. After reading (Stage 4), `NullArray` columns are injected for them into every record batch. This handles both projection and predicate columns — predicates will see NullArrays and naturally evaluate to false/null for comparisons.

**`SnapshotTableReader::read()`** (`storage/reader.rs`) — Accepts the new parameter (unused for now since storage tables are expected to always have all columns).

**`execute_output`** (`plan.rs`) — Simplified to use `scan.with_default_null_columns()` instead of manual missing-column detection and null-array injection.
…upport default-null columns automatically across all phases
@define-null define-null changed the title feat(NET-92) add default null columns support feat(NET-92) Handle missing parquet columns as nulls Mar 10, 2026
@define-null define-null requested a review from kalabukdima March 10, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants