[stacked 2/3, multi-datapipes] transform updates by coreyjadams · Pull Request #1506 · NVIDIA/physicsnemo

coreyjadams · 2026-03-15T02:20:07Z

[stack 2/3]

THis will rebase but contains the transform updates too.

Add multi datapipes + corresponding tests.
Updated tests for updated transforms; Add resize and reshape transforms; add in-memory numpy reader for small single file datasets like darcy flow
Adding most of the multi-dataset cleanly

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

…ms; add in-memory numpy reader for small single file datasets like darcy flow

greptile-apps · 2026-03-15T02:23:42Z

Greptile Summary

This PR adds several features to the datapipes subsystem: a new MultiDataset class for composing multiple Dataset instances behind a single index space, new Resize and Reshape transforms, a preload_to_cpu option for NumpyReader, a fix to Normalize to accept collections.abc.Mapping (not just dict), and a DataLoader.__len__ fix to respect sampler length. All new features come with comprehensive tests.

NumpyReader forced float32 cast is a breaking change: The _load_from_npz and _load_sample_from_preloaded methods now unconditionally cast all arrays to np.float32 (previously preserved original dtype). This silently corrupts integer fields like labels, masks, and indices.
Reshape missing from top-level datapipes/__init__.py: Resize was properly added to the top-level exports, but Reshape was only added to transforms/__init__.py, making it inaccessible via physicsnemo.datapipes.Reshape.
MultiDataset is well-designed with proper index mapping, output strictness validation, prefetch/close delegation, and dataset_index metadata enrichment.
Resize and Reshape transforms are clean implementations with appropriate edge-case handling.

Important Files Changed

Filename	Overview
physicsnemo/datapipes/multi_dataset.py	New MultiDataset class composing multiple Datasets behind a single index space. Well-structured with proper validation, prefetch delegation, and metadata enrichment. Clean implementation.
physicsnemo/datapipes/readers/numpy.py	Adds preload_to_cpu feature and forces float32 conversion. The unconditional float32 cast is a breaking change for integer fields (labels, masks, indices) — both in _load_from_npz and _load_sample_from_preloaded.
physicsnemo/datapipes/init.py	Adds MultiDataset and Resize to top-level exports. Reshape is missing from both the import and all, creating an inconsistency with the transforms/init.py exports.
physicsnemo/datapipes/transforms/spatial.py	Adds Resize transform for grid tensor interpolation. Clean implementation using F.interpolate with proper handling of 2D/3D, channel dimensions, and non-float skipping.
physicsnemo/datapipes/transforms/utility.py	Adds Reshape transform using torch.reshape. Simple, correct implementation that clones data before modifying.
physicsnemo/datapipes/transforms/normalize.py	Broadens dict type check from isinstance(stats, dict) to isinstance(stats, collections.abc.Mapping) to accept OrderedDict and other Mapping types. Correct fix.
physicsnemo/datapipes/transforms/init.py	Properly exports new Resize and Reshape transforms.
physicsnemo/datapipes/dataloader.py	Minor fix to len to prefer sampler length over dataset length when sampler has len. Correct for custom samplers that subset the dataset.
docs/api/datapipes/physicsnemo.datapipes.rst	Adds MultiDataset documentation section with autoclass directive. Clear description of usage and output_strict behavior.
test/datapipes/core/test_multi_dataset.py	Comprehensive test suite for MultiDataset covering basic operations, strict validation, prefetch delegation, DataLoader integration, error cases, and repr. Well-organized test classes.
test/datapipes/readers/test_numpy_consolidated.py	Adds thorough tests for preload_to_cpu and float32 conversion. Tests cover basic preload, default values, missing fields, directory mode, coordinated subsampling, and close/repr behavior.
test/datapipes/transforms/test_spatial.py	Adds comprehensive tests for the new Resize transform covering 2D/3D, modes, align_corners, missing/non-float keys, multiple keys, and invalid size.
test/datapipes/transforms/test_utility.py	Adds tests for Reshape transform covering basic reshape, inferred dimensions, missing keys, preserved fields, multiple keys, and extra_repr.
test/datapipes/core/test_transforms.py	Adds test verifying Normalize accepts OrderedDict stats, validating the collections.abc.Mapping change.

_{Last reviewed commit: 6c0d513}

greptile-apps · 2026-03-15T02:23:46Z

                    arr = arr[:]

-                data[field] = torch.from_numpy(np.array(arr))
+                data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))


Forced float32 cast breaks integer fields

The change from torch.from_numpy(np.array(arr)) to torch.from_numpy(np.asarray(arr, dtype=np.float32)) unconditionally casts every loaded field to float32. This silently corrupts integer fields (e.g., labels, masks, indices stored as int64/int32). The conftest fixture numpy_data_dir even creates label=np.array([i], dtype=np.int64).

Consider only casting floating-point arrays to float32 and leaving integer arrays untouched:

Suggested change

data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))

data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32 if np.issubdtype(arr.dtype, np.floating) else arr.dtype))

greptile-apps · 2026-03-15T02:23:47Z

    NormalizeVectors,
    Purge,
    Rename,
+    Resize,


Reshape missing from top-level exports

Reshape is exported from physicsnemo.datapipes.transforms (added to both the import and __all__ in transforms/__init__.py), but it is not imported or listed in this top-level datapipes/__init__.py. This means users cannot do from physicsnemo.datapipes import Reshape or dp.Reshape(...), unlike Resize which was properly added. This appears to be an oversight.

from physicsnemo.datapipes.transforms import ( ... Rename, Resize, Reshape, # <-- add this Scale, ... ) __all__ = [ ... "ConstantField", "Reshape", # <-- add this ... ]

greptile-apps · 2026-03-15T02:23:47Z

+                arr = np.array(self._preloaded[field][index], copy=False)
+                if subsample_slice is not None and field in target_keys_set:
+                    arr = arr[subsample_slice]
+                data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))


Same float32 cast issue in preloaded path

Same issue as in _load_from_npz: unconditional dtype=np.float32 cast will corrupt integer fields loaded from preloaded arrays. The fix should mirror whatever approach is chosen for _load_from_npz (line 318).

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

… a list, etc.

coreyjadams added 4 commits March 13, 2026 22:04

Add multi datapipes + corresponding tests.

a506f37

Updated tests for updated transforms; Add resize and reshape transfor…

29c55d7

…ms; add in-memory numpy reader for small single file datasets like darcy flow

Adding most of the multi-dataset cleanly

916ea03

Merge branch 'multi-dataset-clean' into transform-updates

6c0d513

coreyjadams requested review from ktangsali and megnvidia as code owners March 15, 2026 02:20

coreyjadams self-assigned this Mar 15, 2026

coreyjadams changed the title ~~transform updates~~ [stacked 2/3, multi-datapipes] transform updates Mar 15, 2026

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

megnvidia and others added 7 commits March 17, 2026 14:24

Refine documentation for PhysicsNeMo Datapipes

1b98bec

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

Merge branch 'main' into multi-dataset-clean

06a07e3

update api docs.

56593a3

Update multidata set interface to accept an unpacked tuple instead of…

f91db31

… a list, etc.

Merge branch 'main' into multi-dataset-clean

6dbd384

Merge branch 'multi-dataset-clean' into transform-updates

6a9e26b

Merge branch 'main' into transform-updates

1ab4e04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stacked 2/3, multi-datapipes] transform updates#1506

[stacked 2/3, multi-datapipes] transform updates#1506
coreyjadams wants to merge 11 commits intoNVIDIA:mainfrom
coreyjadams:transform-updates

coreyjadams commented Mar 15, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 15, 2026

Uh oh!

greptile-apps bot Mar 15, 2026

Uh oh!

greptile-apps bot Mar 15, 2026

Uh oh!

greptile-apps bot Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))
	data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32 if np.issubdtype(arr.dtype, np.floating) else arr.dtype))

Conversation

coreyjadams commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Mar 15, 2026

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coreyjadams commented Mar 15, 2026 •

edited

Loading