Skip to content

[stacked 2/3, multi-datapipes] transform updates#1506

Open
coreyjadams wants to merge 11 commits intoNVIDIA:mainfrom
coreyjadams:transform-updates
Open

[stacked 2/3, multi-datapipes] transform updates#1506
coreyjadams wants to merge 11 commits intoNVIDIA:mainfrom
coreyjadams:transform-updates

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

@coreyjadams coreyjadams commented Mar 15, 2026

[stack 2/3]

THis will rebase but contains the transform updates too.

  • Add multi datapipes + corresponding tests.
  • Updated tests for updated transforms; Add resize and reshape transforms; add in-memory numpy reader for small single file datasets like darcy flow
  • Adding most of the multi-dataset cleanly

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@coreyjadams coreyjadams self-assigned this Mar 15, 2026
@coreyjadams coreyjadams changed the title transform updates [stacked 2/3, multi-datapipes] transform updates Mar 15, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR adds several features to the datapipes subsystem: a new MultiDataset class for composing multiple Dataset instances behind a single index space, new Resize and Reshape transforms, a preload_to_cpu option for NumpyReader, a fix to Normalize to accept collections.abc.Mapping (not just dict), and a DataLoader.__len__ fix to respect sampler length. All new features come with comprehensive tests.

  • NumpyReader forced float32 cast is a breaking change: The _load_from_npz and _load_sample_from_preloaded methods now unconditionally cast all arrays to np.float32 (previously preserved original dtype). This silently corrupts integer fields like labels, masks, and indices.
  • Reshape missing from top-level datapipes/__init__.py: Resize was properly added to the top-level exports, but Reshape was only added to transforms/__init__.py, making it inaccessible via physicsnemo.datapipes.Reshape.
  • MultiDataset is well-designed with proper index mapping, output strictness validation, prefetch/close delegation, and dataset_index metadata enrichment.
  • Resize and Reshape transforms are clean implementations with appropriate edge-case handling.

Important Files Changed

Filename Overview
physicsnemo/datapipes/multi_dataset.py New MultiDataset class composing multiple Datasets behind a single index space. Well-structured with proper validation, prefetch delegation, and metadata enrichment. Clean implementation.
physicsnemo/datapipes/readers/numpy.py Adds preload_to_cpu feature and forces float32 conversion. The unconditional float32 cast is a breaking change for integer fields (labels, masks, indices) — both in _load_from_npz and _load_sample_from_preloaded.
physicsnemo/datapipes/init.py Adds MultiDataset and Resize to top-level exports. Reshape is missing from both the import and all, creating an inconsistency with the transforms/init.py exports.
physicsnemo/datapipes/transforms/spatial.py Adds Resize transform for grid tensor interpolation. Clean implementation using F.interpolate with proper handling of 2D/3D, channel dimensions, and non-float skipping.
physicsnemo/datapipes/transforms/utility.py Adds Reshape transform using torch.reshape. Simple, correct implementation that clones data before modifying.
physicsnemo/datapipes/transforms/normalize.py Broadens dict type check from isinstance(stats, dict) to isinstance(stats, collections.abc.Mapping) to accept OrderedDict and other Mapping types. Correct fix.
physicsnemo/datapipes/transforms/init.py Properly exports new Resize and Reshape transforms.
physicsnemo/datapipes/dataloader.py Minor fix to len to prefer sampler length over dataset length when sampler has len. Correct for custom samplers that subset the dataset.
docs/api/datapipes/physicsnemo.datapipes.rst Adds MultiDataset documentation section with autoclass directive. Clear description of usage and output_strict behavior.
test/datapipes/core/test_multi_dataset.py Comprehensive test suite for MultiDataset covering basic operations, strict validation, prefetch delegation, DataLoader integration, error cases, and repr. Well-organized test classes.
test/datapipes/readers/test_numpy_consolidated.py Adds thorough tests for preload_to_cpu and float32 conversion. Tests cover basic preload, default values, missing fields, directory mode, coordinated subsampling, and close/repr behavior.
test/datapipes/transforms/test_spatial.py Adds comprehensive tests for the new Resize transform covering 2D/3D, modes, align_corners, missing/non-float keys, multiple keys, and invalid size.
test/datapipes/transforms/test_utility.py Adds tests for Reshape transform covering basic reshape, inferred dimensions, missing keys, preserved fields, multiple keys, and extra_repr.
test/datapipes/core/test_transforms.py Adds test verifying Normalize accepts OrderedDict stats, validating the collections.abc.Mapping change.

Last reviewed commit: 6c0d513

arr = arr[:]

data[field] = torch.from_numpy(np.array(arr))
data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forced float32 cast breaks integer fields

The change from torch.from_numpy(np.array(arr)) to torch.from_numpy(np.asarray(arr, dtype=np.float32)) unconditionally casts every loaded field to float32. This silently corrupts integer fields (e.g., labels, masks, indices stored as int64/int32). The conftest fixture numpy_data_dir even creates label=np.array([i], dtype=np.int64).

Consider only casting floating-point arrays to float32 and leaving integer arrays untouched:

Suggested change
data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))
data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32 if np.issubdtype(arr.dtype, np.floating) else arr.dtype))

NormalizeVectors,
Purge,
Rename,
Resize,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reshape missing from top-level exports

Reshape is exported from physicsnemo.datapipes.transforms (added to both the import and __all__ in transforms/__init__.py), but it is not imported or listed in this top-level datapipes/__init__.py. This means users cannot do from physicsnemo.datapipes import Reshape or dp.Reshape(...), unlike Resize which was properly added. This appears to be an oversight.

from physicsnemo.datapipes.transforms import (
    ...
    Rename,
    Resize,
    Reshape,   # <-- add this
    Scale,
    ...
)

__all__ = [
    ...
    "ConstantField",
    "Reshape",   # <-- add this
    ...
]

arr = np.array(self._preloaded[field][index], copy=False)
if subsample_slice is not None and field in target_keys_set:
arr = arr[subsample_slice]
data[field] = torch.from_numpy(np.asarray(arr, dtype=np.float32))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same float32 cast issue in preloaded path

Same issue as in _load_from_npz: unconditional dtype=np.float32 cast will corrupt integer fields loaded from preloaded arrays. The fix should mirror whatever approach is chosen for _load_from_npz (line 318).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants