feat: implement iter_arrow for skip, take and step iterables by Edge-Explorer · Pull Request #7972 · huggingface/datasets

Edge-Explorer · 2026-01-30T05:47:13Z

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:

Fast Batch Processing: Enabled batch-level slicing for .skip(n) and .take(n) on streaming datasets, bypassing slow row-by-row iteration.
Optimized Sharding: Updated StepExamplesIterable (used in distributed training) to use Arrow's .take() to extract multiple records from a batch simultaneously.
State Preservation: Reinforced _init_state_dict and load_state_dict to support flawless checkpointing and resumption while using Arrow iteration.

Performance Impact:

Users will experience significant performance gains when skipping or taking examples in streaming mode. By staying in the "Arrow path" and avoiding Python dictionary conversions, data loading overhead is drastically reduced, especially for large-scale training jobs.

Testing:

Integrated 6 new unit tests into tests/test_iterable_dataset.py to verify:

Functional correctness for skip, take, and step using Arrow iteration.
Reliable state checkpointing and resumption after partial iteration.

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable. Key Changes: - Enabled fast batch-level processing for .skip(n) and .take(n) on streaming datasets. - Optimized distributed sharding (StepExamplesIterable) to use Arrow's .take() for picking multiple records from a batch simultaneously. - Updated _init_state_dict and load_state_dict to ensure seamless checkpointing while using Arrow iteration. Performance Impact: Users will see significant speedups when skipping or taking examples in streaming mode, as the dataset no longer needs to fallback to row-by-row Python dictionary conversion for these operations. Testing: Added 6 new unit tests to ests/test_iterable_dataset.py covering functional correctness and state resumption for all three iterable types.

…urce This commit addresses several documentation quality issues found across the repository — fixing typos, grammar errors, brand name inconsistencies, and adding modern tooling references for new contributors. ## Changes ### README.md - Fix duplicate word: "frameworks frameworks" → "frameworks" - Standardize brand name: "HuggingFace Datasets Hub" → "Hugging Face Datasets Hub" - Add `uv` installation section for faster environment setup ### CONTRIBUTING.md - Add `uv pip install -e ".[dev]"` as an alternative setup command - Fix grammar: "To do, go" → "To do so, go" - Fix punctuation: trailing space before period in pre-commit note (`again .` → `again.`) - Standardize brand name: "HuggingFace [code of conduct]" → "Hugging Face [code of conduct]" ### docs/source/stream.mdx - Fix article usage: "a [`IterableDataset`]" → "an [`IterableDataset`]" (vowel sound rule) - Fix code comment: "shuffles the shards order and use" → "uses" (subject-verb agreement) - Fix phrase: "as soon one of the dataset runs out" → "as soon as one of the datasets runs out" - Fix pluralization: "every samples in every dataset" → "every sample in every dataset" - Fix abbreviation punctuation: "i.e the" → "i.e. the" ### docs/source/quickstart.mdx - Standardize brand name: "a HuggingFace [`~datasets.Dataset`]" → "a Hugging Face [`~datasets.Dataset`]" (3 occurrences) ### docs/README.md - Standardize copyright notice: "The HuggingFace Team" → "The Hugging Face Team" ### notebooks/README.md - Standardize copyright notice: "The HuggingFace Team" → "The Hugging Face Team" ### src/datasets/iterable_dataset.py - Fix typo in `map()` docstring: "simulatenous" → "simultaneous" - Fix typo in `filter()` docstring: "simulatenous" → "simultaneous" - Add return type hint to `identity_func`: `(x)` → `(x: Any) -> Any` - Add return type hint to `_rename_columns_fn`: missing `-> dict` return type

Edge-Explorer added 5 commits January 30, 2026 11:15

Merge branch 'main' into optimize-streaming-arrow-iter

a53e770

Merge branch 'main' into optimize-streaming-arrow-iter

cc99442

Merge main into feature

c2514f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement iter_arrow for skip, take and step iterables#7972

feat: implement iter_arrow for skip, take and step iterables#7972
Edge-Explorer wants to merge 5 commits intohuggingface:mainfrom
Edge-Explorer:optimize-streaming-arrow-iter

Edge-Explorer commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Edge-Explorer commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Performance Impact:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Edge-Explorer commented Jan 30, 2026 •

edited

Loading