PhysicsNemo Datapipes #1304

coreyjadams · 2026-01-05T22:26:15Z

PhysicsNeMo Pull Request

Description

This PR implements Physicsnemo Datapipes as configurable pipelines with reusable components. It is meant to provide a foundation for training and inference pipelines, though full scale replacement / updating / deprecation of existing targeted pipelines is for the future and not implemented here.

I will defer a description of the datapipe design to the README and tutorials - located in examples/minimal/datapipes/.

The core components of the datapipe are all present:

reader abstraction, for getting data from disk into CPU. Implementations for zarr and .npz are here.
transforms, for manipulating data on the GPU. Some transforms are already here, but it's easy to extend.
datasets, for preloading and staging data to the GPU
dataloader, for drop-in replacement with pytorch.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

greptile-apps · 2026-01-05T22:33:41Z

Greptile Summary

Implements a new PhysicsNeMo DataPipes system with GPU-first data loading infrastructure featuring readers, transforms, datasets, and dataloaders as configurable, reusable components
Adds comprehensive support for multiple data formats (Zarr, NPZ, HDF5, VTK) with high-performance async reading capabilities and coordinated subsampling for large datasets
Includes extensive test coverage, configuration files, and tutorials demonstrating the modular architecture with Hydra-based configuration management

Important Files Changed

Filename	Overview
`physicsnemo/datapipes/core/transforms/geometric.py`	Implements geometric transforms but contains incorrect class names in examples and potential device handling issues
`test/datapipes/core/test_readers.py`	Comprehensive reader tests but contains debug print statement that should be removed
`physicsnemo/datapipes/core/transforms/field_processing.py`	BroadcastGlobalFeatures transform with potential tensor broadcasting errors that could cause runtime failures
`physicsnemo/datapipes/core/transforms/spatial.py`	Spatial transforms with documentation referencing undefined classes and potential indexing errors
`test/datapipes/core/transforms/test_subsample.py`	Subsampling tests with broken comment syntax and proper testing infrastructure

greptile-apps

Additional Comments (47)

test/datapipes/core/test_readers.py, line 200 (link)

syntax: Debug print statement should be removed
test/datapipes/core/transforms/test_subsample.py, line 126-127 (link)

syntax: Comment has a line break in the middle of a word
v2.0-MIGRATION-GUIDE.md, line 58 (link)

syntax: Extra asterisk in 'DataPipes**' should be 'DataPipes'
physicsnemo/datapipes/core/__init__.py, line 37 (link)

syntax: Inconsistent naming in example - uses dp.Downsample but exports SubsamplePoints
examples/minimal/datapipes/tutorial_04_hydra_config.py, line 54 (link)

style: torch is imported but not used in this tutorial

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
test/datapipes/core/test_dataset.py, line 238 (link)

style: Inconsistent device specification - uses cuda:0 in line 138 but cuda here without device index
examples/minimal/datapipes/generate_variable_points_data.py, line 298 (link)

syntax: Backend choice inconsistency: example shows '-b npy' but valid choices are only 'npz' and 'zarr'
test/datapipes/core/test_collate.py, line 227-228 (link)

logic: Test function returns first sample without proper tuple unpacking - should return (data, metadata) tuple for consistency
examples/minimal/datapipes/README.md, line 25 (link)

syntax: Typo: 'ascynchronous' should be 'asynchronous'
examples/minimal/datapipes/README.md, line 30 (link)

syntax: Typo: 'reproducability' should be 'reproducibility'
examples/minimal/datapipes/README.md, line 81 (link)

syntax: Grammar: 'for understand' should be 'for understanding'
physicsnemo/datapipes/core/dataloader.py, line 56 (link)

syntax: Import statement uses incorrect module name - should be from physicsnemo.datapipes import ... instead of from datapipe import ...
examples/minimal/datapipes/tutorial_02_transforms.py, line 292 (link)

syntax: Grammar: 'it's' should be 'its' (possessive, not contraction)
.github/CODEOWNERS, line 62 (link)

style: Trailing whitespace after @pzharrington should be removed
examples/minimal/datapipes/tutorial_03_custom_gnn_datapipe.py, line 241-242 (link)

logic: Variable n_edges is misleading - this represents edges per node (k value), not total edges in the graph.
examples/minimal/datapipes/tutorial_03_custom_gnn_datapipe.py, line 246-247 (link)

logic: The calculation n_edges / n_nodes is incorrect for average degree - should be n_edges since n_edges already represents k neighbors per node.
physicsnemo/datapipes/core/collate.py, line 169 (link)

logic: Using torch.stack() on TensorDict objects directly - should verify this is the intended API as TensorDict.stack() might be the correct method

Should this be TensorDict.stack(data_list, dim=self.stack_dim) instead of torch.stack()?
physicsnemo/datapipes/core/collate.py, line 353 (link)

logic: The function signature indicates it returns a tuple but uses a DefaultCollator instance that may not collate metadata by default

Should the _default_collator be initialized with collate_metadata=True to match the return type annotation?
test/datapipes/core/test_transforms.py, line 517-612 (link)

style: Large blocks of commented test code should be removed or converted to proper TODO issues rather than left in the codebase

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
test/datapipes/core/test_transforms.py, line 619-659 (link)

style: Another large block of commented test code that should be cleaned up

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
examples/minimal/datapipes/tutorial_01_getting_started.py, line 300 (link)

logic: This line will cause a TypeError. batch_data is a TensorDict, not a tuple/list, so batch_data[1] is invalid. Remove this debug print statement or access a valid key.
test/datapipes/core/test_transforms.py, line 764-767 (link)

style: Remove commented test code to keep the file clean

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/transforms/compose.py, line 93-95 (link)

style: Missing return type annotation for __iter__ method. Should be Iterator[Transform].
physicsnemo/datapipes/core/readers/numpy.py, line 186 (link)

style: This random sampling could cause reproducibility issues if no seed is set. Consider using a seeded random state or documenting the need for external seed management. Should this method accept a random state parameter or use a class-level random state to ensure reproducible subsampling?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/readers/numpy.py, line 252 (link)

style: The np.array(arr) wrapping may create an unnecessary copy. Since arr is already a numpy array from npz, torch.from_numpy(arr) should work directly.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/transforms/geometric.py, line 53-57 (link)

syntax: Example uses undefined Sample class - should use TensorDict
physicsnemo/datapipes/core/transforms/geometric.py, line 225-228 (link)

syntax: Example references TranslationInvariance but class is named Translate
physicsnemo/datapipes/core/transforms/geometric.py, line 282-287 (link)

logic: Logic error: device assignment after type error could be unreachable
physicsnemo/datapipes/core/transforms/geometric.py, line 305-308 (link)

syntax: Example references ScaleInvariance but class is named ReScale
physicsnemo/datapipes/core/transforms/subsample.py, line 147-150 (link)

style: The example uses Sample but the actual parameter type is TensorDict. This inconsistency could confuse users.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/transforms/subsample.py, line 244 (link)

logic: Missing validation that weights tensor has the same length as the data tensors being sampled.
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 34 (link)

style: Using check_version_spec instead of the recommended check_min_version from MOD-011. The coding standards specify using check_min_version(package, version, hard_fail=False) for optional dependency handling.

Context Used: File from greptile.json - CODING_STANDARDS/MODELS_IMPLEMENTATION.md (source)
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 179-185 (link)

style: The _spec_template is defined but never used in the implementation. Consider removing it or documenting its intended purpose.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 272 (link)

style: Field discovery is called twice (line 167 and here) for the same group, which is inefficient. Consider caching the result from the first discovery.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/transforms/spatial.py, line 53-58 (link)

logic: Example uses undefined Sample class instead of TensorDict. Should be consistent with actual data structure used in the transforms.
physicsnemo/datapipes/core/transforms/normalize.py, line 285-291 (link)

style: Device transfer modifies internal state during forward pass which could cause issues in multi-threaded environments or when the same transform is used across different devices simultaneously

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/transforms/normalize.py, line 197 (link)

syntax: Return type annotation is imprecise - should be dict[str, dict[str, torch.Tensor]] to match the actual structure returned
physicsnemo/datapipes/core/transforms/normalize.py, line 357-371 (link)

logic: Inconsistent with forward pass - inverse method doesn't update internal state when transferring to device, which could cause device mismatch issues
physicsnemo/datapipes/core/transforms/spatial.py, line 285-286 (link)

logic: Potential indexing error when k=1. Slicing [:, 1:] on a tensor with shape [M, 1, ...] results in empty tensor [M, 0, ...]. Should there be validation that k > 1, or should the slicing logic handle the k=1 case differently?
physicsnemo/datapipes/core/dataset.py, line 77 (link)

syntax: Import path in docstring example uses from datapipe import but should be from physicsnemo.datapipes.core import to match the actual module structure
physicsnemo/datapipes/core/readers/hdf5.py, line 128 (link)

style: Opening HDF5 file handle in constructor without explicit mode could lead to resource leaks if constructor fails after this point.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
physicsnemo/datapipes/core/readers/vtk.py, line 32 (link)

logic: Using deprecated function name. Should use check_min_version instead of check_version_spec according to MOD-011.

Context Used: File from greptile.json - CODING_STANDARDS/MODELS_IMPLEMENTATION.md (source)
physicsnemo/datapipes/core/readers/vtk.py, line 165 (link)

logic: Duplicate key surface_normals exists in both _stl_keys (line 161) and _vtp_keys (line165). This could cause confusion when filtering keys.
physicsnemo/datapipes/core/readers/vtk.py, line 247-252 (link)

logic: Inconsistent logic: need_stl defaults to True when keys_to_read is None, but need_vtp and need_vtu require explicit keys. This asymmetry may cause unexpected behavior.
physicsnemo/datapipes/core/transforms/field_processing.py, line 111-112 (link)

logic: Scalar expansion logic may not handle multi-dimensional features correctly. A 1D feature with shape (5,) would be stacked incorrectly with scalars that get expanded to (1,).
physicsnemo/datapipes/core/transforms/field_processing.py, line 120 (link)

syntax: broadcast_to(n_points, -1) syntax is invalid. PyTorch requires explicit target shape like (n_points, fx.shape[-1]).
physicsnemo/datapipes/core/transforms/field_processing.py, line 124-128 (link)

style: __repr__ is missing n_points_key parameter in the representation string

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{48 files reviewed, 47 comments}

_{Edit Code Review Agent Settings | Greptile}

peterdsharpe

Submitting initial comments so far; continuing with review...

peterdsharpe · 2026-01-05T22:40:19Z

examples/minimal/datapipes/README.md

+# PhysicsNeMo DataPipes
+
+Dataloading is critical to SciML applications, both for training and inference,
+and the physicsnemo datapipe infrastructure aims to deliver a flexible and configurable


Somewhat minor, but I notice that python / physicsnemo / pytorch are consistently not capitalized throughout this *.md and others. My preference would be to capitalize them in any plaintext docs ("Python", "PhysicsNeMo", "PyTorch").

Good point, I'll fix it. At least you know that if an LLM had written this, it would have capitalized it :)

peterdsharpe · 2026-01-05T22:40:37Z

examples/minimal/datapipes/README.md

+in multiple threads, instead of multiple processes, and streams enable multiple
+preprocessing pipelines to execute concurrently on the GPU.
+
+3. **Unambiguous Configuration and Serialization** - Datapipes can be a particularly


Would be worth defining "datapipes" here for new readers

Good idea, on it.

peterdsharpe · 2026-01-07T09:24:32Z

physicsnemo/datapipes/core/transforms/base.py

+        self._device: Optional[torch.device] = None
+
+    @abstractmethod
+    def __call__(self, data: TensorDict) -> TensorDict:


Are re-implementations allowed to have optional kwargs, or is it really just data allowed and that's it?

The design is really "just data and that's it". The call to all transforms is mostly automatic and transparent to the user, it happens in the dataset:

# Apply transforms (data is now on target device if specified) if self.transforms is not None: if stream is not None: with torch.cuda.stream(stream): data = self.transforms(data) # Record event for synchronization result.event = torch.cuda.Event() result.event.record(stream) else: data = self.transforms(data)

My intention is that configuration of transforms happens at initialization - and you can have multiple of the same kind with different arguments, that's fine. Is there some dynamic behavior we need that isn't supported?

peterdsharpe · 2026-01-07T09:25:30Z

physicsnemo/datapipes/core/transforms/base.py

+        """
+        Move any internal tensors to the specified device.
+
+        Override this method if your transform has internal tensor state.


Could we provide a default implementation that checks self.__dict__ for any tensor leaves, and transfers them if so?

Sure good idea, makes it easier to extend in user space.

physicsnemo/datapipes/core/transforms/base.py

peterdsharpe · 2026-01-07T09:28:42Z

physicsnemo/datapipes/core/transforms/base.py

+    def __repr__(self) -> str:
+        extra = self.extra_repr()
+        if extra:
+            return f"{self.__class__.__name__}({extra})"


Super minor, but can't this if statement be removed for readability, since it defaults to ""?

def __repr__(self) -> str: return f"{self.__class__.__name__}({self.extra_repr()})"

peterdsharpe · 2026-01-07T17:01:09Z

examples/minimal/datapipes/tutorial_01_getting_started.py

@@ -0,0 +1,447 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.


Would these tutorials be perhaps clearer as Jupyter notebooks?

Seems like there's a lot of print() going on that might be better presented as a Markdown block. (Plus other benefits, like being able to convert this into a docs page later, etc.)

Sure! I didn't do that because we don't have many notebooks floating around in the examples section. But it is cleaner in a notebook, yeah.

peterdsharpe · 2026-01-07T17:04:19Z

physicsnemo/datapipes/core/readers/base.py

+        ...         super().__init__(**kwargs)
+        ...         self.data = load_my_data(path)
+        ...
+        ...     def _load_sample(self, index: int) -> dict[str, torch.Tensor]:


The docs above say "Return (TensorDict, metadata_dict)", but here it looks like everything's returning dict[str, torch.Tensor] rather than TensorDict?

The docs are referring to the getitem call, which does the actual tensordict wrapping:

def __getitem__(self, index: int) -> tuple[TensorDict, dict[str, Any]]:

The idea here was it's conceptually simple to implement the def _load_sample(self, index: int) -> dict[str, torch.Tensor] function (and __len__ is also required) and the base class will handle the metadata + tensordict conversion for you.

physicsnemo/datapipes/core/readers/hdf5.py

Co-authored-by: Peter Sharpe <peterdsharpe@gmail.com>

laserkelvin · 2026-01-08T20:49:29Z

physicsnemo/datapipes/core/transforms/spatial.py

+
+
+@register_transform()
+class KNNNeighbors(Transform):


This is a dumb nitpick but NNN feels very typo prone

Alexey-Kamenev

LGTM!

Alexey-Kamenev · 2026-01-09T17:46:22Z

examples/minimal/datapipes/tutorial_03_custom_gnn_datapipe.py

+    print("Tutorial 3 Complete!")
+    print()
+    print("Key takeaways:")
+    print("  1. KNNNeighbors transform: Computes graph structure from point clouds")


I think Kelvin already pointed that out, but one N is redundant. I would use something like KNearestNeighbors instead.

coreyjadams added 6 commits December 23, 2025 11:22

Passing tests for datapipes

80fa0e5

Fix datapipe tests

036a979

First pass at examples, not quite ready yet.

ec7cd21

Add datapipe tutorials.

322a40c

Update migration guide.

94f3dd1

Update codeowners for datapipes.

0180ecb

coreyjadams requested a review from Alexey-Kamenev January 5, 2026 22:28

Merge branch 'main' into datapipes-rebased

639e2fa

coreyjadams self-assigned this Jan 5, 2026

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

coreyjadams added 2 commits January 6, 2026 02:06

Address greptile review comments

2e028c6

Fix failing datapipe tests

9902ace

peterdsharpe reviewed Jan 7, 2026

View reviewed changes

physicsnemo/datapipes/core/readers/hdf5.py Outdated Show resolved Hide resolved

coreyjadams and others added 3 commits January 7, 2026 11:16

Update physicsnemo/datapipes/core/transforms/base.py

745dbaf

Co-authored-by: Peter Sharpe <peterdsharpe@gmail.com>

Update physicsnemo/datapipes/core/readers/hdf5.py

6003c7d

Co-authored-by: Peter Sharpe <peterdsharpe@gmail.com>

Merge branch 'NVIDIA:main' into datapipes-rebased

5c4c4ce

laserkelvin reviewed Jan 8, 2026

View reviewed changes

Merge branch 'NVIDIA:main' into datapipes-rebased

437ff79

Alexey-Kamenev approved these changes Jan 9, 2026

View reviewed changes

		@@ -0,0 +1,447 @@
		# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.

PhysicsNemo Datapipes #1304

Are you sure you want to change the base?

PhysicsNemo Datapipes #1304

Uh oh!

Conversation

coreyjadams commented Jan 5, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (47)

Uh oh!

peterdsharpe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alexey-Kamenev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot left a comment •

edited

Loading