Skip to content

feat: implement iter_arrow for skip, take and step iterables#7972

Open
Edge-Explorer wants to merge 5 commits intohuggingface:mainfrom
Edge-Explorer:optimize-streaming-arrow-iter
Open

feat: implement iter_arrow for skip, take and step iterables#7972
Edge-Explorer wants to merge 5 commits intohuggingface:mainfrom
Edge-Explorer:optimize-streaming-arrow-iter

Conversation

@Edge-Explorer
Copy link
Copy Markdown
Contributor

@Edge-Explorer Edge-Explorer commented Jan 30, 2026

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:

  • Fast Batch Processing: Enabled batch-level slicing for .skip(n) and .take(n) on streaming datasets, bypassing slow row-by-row iteration.
  • Optimized Sharding: Updated StepExamplesIterable (used in distributed training) to use Arrow's .take() to extract multiple records from a batch simultaneously.
  • State Preservation: Reinforced _init_state_dict and load_state_dict to support flawless checkpointing and resumption while using Arrow iteration.

Performance Impact:

Users will experience significant performance gains when skipping or taking examples in streaming mode. By staying in the "Arrow path" and avoiding Python dictionary conversions, data loading overhead is drastically reduced, especially for large-scale training jobs.

Testing:

Integrated 6 new unit tests into tests/test_iterable_dataset.py to verify:

  • Functional correctness for skip, take, and step using Arrow iteration.
  • Reliable state checkpointing and resumption after partial iteration.

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:
- Enabled fast batch-level processing for .skip(n) and .take(n) on streaming datasets.
- Optimized distributed sharding (StepExamplesIterable) to use Arrow's .take() for picking multiple records from a batch simultaneously.
- Updated _init_state_dict and load_state_dict to ensure seamless checkpointing while using Arrow iteration.

Performance Impact:
Users will see significant speedups when skipping or taking examples in streaming mode, as the dataset no longer needs to fallback to row-by-row Python dictionary conversion for these operations.

Testing:
Added 6 new unit tests to 	ests/test_iterable_dataset.py covering functional correctness and state resumption for all three iterable types.
…urce

This commit addresses several documentation quality issues found across
the repository — fixing typos, grammar errors, brand name inconsistencies,
and adding modern tooling references for new contributors.

## Changes

### README.md
- Fix duplicate word: "frameworks frameworks" → "frameworks"
- Standardize brand name: "HuggingFace Datasets Hub" → "Hugging Face Datasets Hub"
- Add `uv` installation section for faster environment setup

### CONTRIBUTING.md
- Add `uv pip install -e ".[dev]"` as an alternative setup command
- Fix grammar: "To do, go" → "To do so, go"
- Fix punctuation: trailing space before period in pre-commit note (`again .` → `again.`)
- Standardize brand name: "HuggingFace [code of conduct]" → "Hugging Face [code of conduct]"

### docs/source/stream.mdx
- Fix article usage: "a [`IterableDataset`]" → "an [`IterableDataset`]" (vowel sound rule)
- Fix code comment: "shuffles the shards order and use" → "uses" (subject-verb agreement)
- Fix phrase: "as soon one of the dataset runs out" → "as soon as one of the datasets runs out"
- Fix pluralization: "every samples in every dataset" → "every sample in every dataset"
- Fix abbreviation punctuation: "i.e the" → "i.e. the"

### docs/source/quickstart.mdx
- Standardize brand name: "a HuggingFace [`~datasets.Dataset`]" → "a Hugging Face [`~datasets.Dataset`]" (3 occurrences)

### docs/README.md
- Standardize copyright notice: "The HuggingFace Team" → "The Hugging Face Team"

### notebooks/README.md
- Standardize copyright notice: "The HuggingFace Team" → "The Hugging Face Team"

### src/datasets/iterable_dataset.py
- Fix typo in `map()` docstring: "simulatenous" → "simultaneous"
- Fix typo in `filter()` docstring: "simulatenous" → "simultaneous"
- Add return type hint to `identity_func`: `(x)` → `(x: Any) -> Any`
- Add return type hint to `_rename_columns_fn`: missing `-> dict` return type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant