Skip to content

GH-3411 Expose row group index via Parquet reader#3412

Open
uros7251brick wants to merge 1 commit intoapache:masterfrom
uros7251brick:expose-row-group-idx
Open

GH-3411 Expose row group index via Parquet reader#3412
uros7251brick wants to merge 1 commit intoapache:masterfrom
uros7251brick:expose-row-group-idx

Conversation

@uros7251brick
Copy link

Rationale for this change

Engines like Apache Spark need to know which row group a record belongs to — for example, to expose row group metadata as a hidden column, or to correlate records with row group-level statistics. Without this API, callers have no way to determine the current row group index during sequential reads.

What changes are included in this PR?

Similar to how getCurrentRowIndex() was introduced to expose the current row's file-level index, this adds getCurrentRowGroupIndex() to expose the index of the row group currently being read.

New API:

  • ParquetFileReader.getCurrentRowGroupIndex() — returns the 0-based index of the last row group read via readNextRowGroup() / readNextFilteredRowGroup(). Returns -1 before any row group has been read.
  • ParquetReader.getCurrentRowGroupIndex() — same semantics, for the high-level record reader.
  • ParquetRecordReader.getCurrentRowGroupIndex() — same, for the Hadoop MapReduce record reader.

The returned index is the actual file-level row group index, meaning it correctly reflects gaps when empty row groups are skipped (e.g. if row group 1 is empty, the indices reported will be 0, 2, ... not 0, 1, ...).

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Closes #3411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose row group index in Parquet readers

1 participant