mesh: enable ShardTensor support for mesh conversion/geometry paths by loliverhennigh · Pull Request #1608 · NVIDIA/physicsnemo

loliverhennigh · 2026-04-30T00:53:10Z

PhysicsNeMo Pull Request

Description

copy-pr-bot · 2026-04-30T00:53:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

loliverhennigh · 2026-04-30T00:54:55Z


    from ._shard_tensor_spec import ShardTensorSpec
-    from .shard_tensor import ShardTensor, scatter_tensor
+    from .shard_tensor import ShardTensor, replicated_zeros_like, scatter_tensor


Hey @coreyjadams, if I am adding the zeros_like correct I might add a few similar ops just consistency even though they are not needed.

greptile-apps · 2026-04-30T00:56:45Z

Greptile Summary

This PR adds ShardTensor support to the mesh conversion and geometry computation paths by introducing replicated_zeros_like/new_replicated_zeros helpers, a _cross_wrapper handler for torch.linalg.cross/torch.cross, replacing advanced-index patterns in mesh.py with expand/reshape, and wrapping torch.zeros calls in _scatter_ops.py with ShardTensor-aware allocators. Test coverage is added for both dense and sharded modes via an opt-in environment variable.

Important Files Changed

Filename	Overview
physicsnemo/domain_parallel/shard_tensor.py	Adds `new_replicated_zeros`, `replicated_zeros_like`, and `_cross_wrapper`; the wrapper has a default-`dim` mismatch for `torch.cross` calls and a hardcoded error message.
physicsnemo/domain_parallel/init.py	Exports `replicated_zeros_like` in both the available and stub branches; straightforward and correct.
physicsnemo/mesh/mesh.py	Replaces advanced-indexing (`cell_values[cell_indices]`) with `unsqueeze/expand/reshape` to avoid ShardTensor incompatibility; semantics are equivalent.
physicsnemo/mesh/utilities/_scatter_ops.py	Introduces `_is_sharded_tensor` (duck-types on private `_spec`) and `_replicated_zeros_like`; the private-attribute check is fragile across PyTorch versions.
test/mesh/mesh/test_data_conversion.py	Adds ShardTensor-aware fixtures and mode-switching; `_single_rank_dist_group` uses `NamedTemporaryFile` which pre-creates the rendezvous file before `init_process_group`.
test/mesh/mesh/test_geometry_properties.py	Adds ShardTensor geometry test modes; shares the same `NamedTemporaryFile` concern as `test_data_conversion.py` and duplicates several helpers that could live in a shared conftest.

Comments Outside Diff (2)

physicsnemo/mesh/utilities/_scatter_ops.py, line 173-174 (link)

Duck-type detection relies on a private attribute

_is_sharded_tensor checks for hasattr(tensor, "_spec") — a private attribute. If ShardTensor/DTensor renames or reorganises that attribute in a future PyTorch release the check will silently fall through to the plain-tensor branch, causing incorrect zero-tensor allocation in distributed runs rather than a clear error. Prefer an isinstance check against the optionally-imported ShardTensor type:
```
try:
    from physicsnemo.domain_parallel import ShardTensor as _ShardTensor
except ImportError:
    _ShardTensor = None

def _is_sharded_tensor(tensor: torch.Tensor) -> bool:
    return _ShardTensor is not None and isinstance(tensor, _ShardTensor)
```
test/mesh/mesh/test_data_conversion.py, line 411-430 (link)

NamedTemporaryFile creates the file before init_process_group

tempfile.NamedTemporaryFile(delete=True) creates and opens the file immediately, so it already exists when dist.init_process_group(init_method=f"file://{f.name}", …) is called. PyTorch's FileStore (file-based rendezvous) expects to create/own the file itself; a pre-existing file may cause silent corruption or an AssertionError on certain platforms.

A more robust pattern is to use a path that does not yet exist:
```
with tempfile.TemporaryDirectory(prefix="mesh_shard_pg_") as tmpdir:
    dist.init_process_group(
        backend="gloo",
        init_method=f"file://{tmpdir}/rendezvous",
        rank=0,
        world_size=1,
    )
```
The same pattern is duplicated in test/mesh/mesh/test_geometry_properties.py and should be updated there as well.

_{Reviews (1): Last reviewed commit: "mesh: enable ShardTensor support for mes..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T00:56:49Z

+    dim = kwargs.get("dim", -1)
+    if len(args) > 2:
+        dim = args[2]
+
+    if not isinstance(input_tensor, ShardTensor) or not isinstance(
+        other_tensor, ShardTensor
+    ):
+        raise RuntimeError(
+            "torch.linalg.cross with ShardTensor inputs requires both arguments to be ShardTensor."
+        )


torch.cross dim defaulting diverges from original behavior

When _cross_wrapper is invoked via the torch.cross handler (registered on line 1062) and the caller omits dim, the wrapper defaults to dim=-1. However, torch.cross (pre-deprecation) auto-detects the first dimension of size 3, which may not be the last dimension. Any call like torch.cross(a_shard, b_shard) where the cross-product axis isn't the last one will silently produce a wrong result instead of matching the original op's semantics.

For torch.linalg.cross, dim is keyword-only and defaults to -1, so the current default is correct there. For the torch.cross handler, consider either (a) raising explicitly when dim is absent to force callers to use the unambiguous form, or (b) documenting that only the torch.linalg.cross semantic (dim=-1) is supported.

greptile-apps · 2026-04-30T00:56:49Z

+    if not isinstance(input_tensor, ShardTensor) or not isinstance(
+        other_tensor, ShardTensor
+    ):
+        raise RuntimeError(
+            "torch.linalg.cross with ShardTensor inputs requires both arguments to be ShardTensor."
+        )


Hardcoded error message for both torch.cross and torch.linalg.cross

The error string says "torch.linalg.cross with ShardTensor inputs…" but this wrapper is registered for both torch.linalg.cross (line 1060) and torch.cross (line 1062). When the handler fires via torch.cross, the message will mislead users.

Suggested change

if not isinstance(input_tensor, ShardTensor) or not isinstance(

other_tensor, ShardTensor

):

raise RuntimeError(

"torch.linalg.cross with ShardTensor inputs requires both arguments to be ShardTensor."

)

raise RuntimeError(

f"{func.__module__}.{func.__name__} with ShardTensor inputs requires both arguments to be ShardTensor."

)

loliverhennigh · 2026-04-30T00:57:19Z

Blaa, some stuff got messed up and not quite ready for review. Ill fix tonight...

…ities

peterdsharpe · 2026-04-30T14:20:56Z



+def _is_sharded_tensor(tensor: torch.Tensor) -> bool:
+    return hasattr(tensor, "_spec") and hasattr(type(tensor), "from_local")


@coreyjadams is this best-practices for type-narrowing ShardTensor?

Feels like isinstance(tensor, ShardTensor) would be better, unless there's something I'm missing?

(And if this does get replaced with isinstance, then this can be inlined rather than keeping a separate function.)

peterdsharpe · 2026-04-30T14:23:09Z

+    )
+
+
+def _mesh_to_mode(mesh: Mesh, *, mesh_tensor_mode: str, mesh_shard_device_mesh) -> Mesh:


Docstrings needed (here and in above functions)

peterdsharpe · 2026-04-30T14:23:40Z

+
+
+@pytest.fixture(params=_MESH_TENSOR_MODES)
+def mesh_tensor_mode(request):


Type hints?

peterdsharpe · 2026-04-30T14:23:47Z

+
+
+@pytest.fixture
+def mesh_shard_device_mesh(mesh_tensor_mode, _single_rank_dist_group):


Type hints + docstrings needed

peterdsharpe · 2026-04-30T14:24:43Z

+    n_cells: int,
+    point_placement: Replicate | Shard,
+    cell_placement: Replicate | Shard,
+):


docstring needed

peterdsharpe · 2026-04-30T14:26:10Z

+    r"""Create zeros matching a tensor's device/mesh semantics.
+
+    For ``ShardTensor`` inputs this returns a replicated ``ShardTensor`` on the
+    same mesh. For regular tensors this falls back to ``torch.zeros`` on the


Given that this is part of the public API, please add complete NumPy-style docstring

peterdsharpe · 2026-04-30T14:29:10Z

-                src_data=cell_values[cell_indices],
+                src_data=cell_values.unsqueeze(1)
+                .expand(-1, n_vertices_per_cell, *cell_values.shape[1:])
+                .reshape(-1, *cell_values.shape[1:]),


Why is Mesh.cell_data_to_point_data() updated, but Mesh.point_data_to_cell_data() is not? The asymmetry seems suspect - seems like they should either both need updates for ShardTensor or neither should, but I might be missing something?

Or is this just an unrelated change (and if so, what motivated this)?

Its because we needed a shardTensor to index hence the change. In the point_data_to_cell_data we use cells to index but that is already a ShardTensor.

peterdsharpe · 2026-04-30T14:33:12Z

+
+
+def _convert_data_for_mode(
+    data: dict[str, torch.Tensor] | None,


This function incorrectly:
a) accepts dict[str, torch.Tensor] instead of TensorDict
b) iterates only over top-level keys.

Combined with downstream uses that pass in TensorDict, this causes issues in hierarchical dictionaries. (Specifically, this will cause nested values to remain as torch.Tensor.)

Consider using TensorDict.apply() instead.

coreyjadams

Hi @loliverhennigh - I have left a number of comments on the PR. I think there are some updates to make, though I'm happy to see the changes overall are not too extensive. I think there are some design choices that cut against the philosophy of ShardTensor we should tweak, but it doesn't look likes going to be too much.

Happy to discuss this with as much detail as you'd like offline!

coreyjadams · 2026-04-30T14:38:52Z


        return _ToTorchTensor.apply(self, grad_placements)

+    def new_replicated_zeros(


To add this function as an API call on shard tensor, vs supporting the underlying dispatch call, has to have a really good motivation.

What is the value of this vs. supporting the backend of torch.zeros_like(a) when a is a shard tensor? and, in fact, I think that should already work?

coreyjadams · 2026-04-30T14:40:20Z

+        return ShardTensor.from_local(
+            local,
+            self._spec.mesh,
+            [Replicate() for _ in range(self._spec.mesh.ndim)],
+            sharding_shapes="infer",


This is unusual here: typically I'd object loudly to passing "infer" as sharding shapes since that can trigger a blocking allreduce and is a major perf headache. but you're replicating on all ranks. I don't think I understand this function's role, really.

coreyjadams · 2026-04-30T14:42:01Z

+def replicated_zeros_like(
+    tensor: torch.Tensor,
+    shape: Sequence[int] | torch.Size,
+    *,
+    dtype: torch.dtype | None = None,
+) -> torch.Tensor:


I still think it makes more sense to support torch.zeros_like(a) for a as a replicated tensor.

tl;dr the ShardTensor design philosophy is to make zero code changes on user side, whenever possible, so we support torch calls on shard tensors first and foremost rather then introduce new API. Is it possible to implement your work without this?

coreyjadams · 2026-04-30T14:42:44Z

+def _cross_wrapper(func, types, args, kwargs):
+    if kwargs is None:
+        kwargs = {}
+
+    if kwargs.get("out", None) is not None:
+        raise RuntimeError("torch.linalg.cross(out=...) is not supported for ShardTensor.")


If this is a function overload, it's in the wrong place. shard_tensor.py is for the core tensor object only. There is a sub folder for ops.

coreyjadams · 2026-04-30T14:43:46Z

+        raise RuntimeError(
+            "torch.linalg.cross with ShardTensor inputs requires both arguments to be ShardTensor."
+        )


It looks like what we want to be doing is implementing a wrapper for torch.linalg.cross on shard tensor objects. There is not a need to check that all objects are ShardTensor at this time.

coreyjadams · 2026-04-30T14:47:48Z

-        aggregated_data = torch.zeros((n_dst, *data_shape), dtype=dtype, device=device)
+        aggregated_data = _replicated_zeros_like(
+            src_data, (n_dst, *data_shape), dtype
+        )


The distributed-friendly update to make here is, if mesh can support it, to port torch.zeros away from this shape and onto torch.zeros_like. So whatever is building data_shape as input, we build zeros_like(that_object) and then mesh can work on single device and sharded inputs too.

coreyjadams · 2026-04-30T14:48:21Z

-        weights = torch.ones(len(src_to_dst_mapping), dtype=dtype, device=device)
+        if _is_sharded_tensor(src_data):
+            weights = torch.ones_like(src_to_dst_mapping, dtype=dtype)
+        else:
+            weights = torch.ones(len(src_to_dst_mapping), dtype=dtype, device=device)


This is much closer to the "right" way for domain parallelism, but in fact we can probably consolidate to just "ones_like" for all paths.

coreyjadams · 2026-04-30T14:48:37Z

-        dtype=dtype,
-        device=device,
-    )
+    aggregated_data = _replicated_zeros_like(src_data, (n_dst, *data_shape), dtype)


Same here re:zeros_like

coreyjadams · 2026-04-30T14:48:51Z

-        weight_sums = torch.zeros(n_dst, dtype=dtype, device=device)
+        weight_sums = _replicated_zeros_like(src_data, (n_dst,), dtype)


Same here re:zeros_like

coreyjadams · 2026-04-30T14:52:58Z


        converted = self.cell_data.apply(
            lambda cell_values: scatter_aggregate(
-                src_data=cell_values[cell_indices],


Indexing a ShardTensor with a ShardTensor index should work fine?

cell_indices is not a ShardTensor the way it was before. I was coming from just the torch.arange function

…r-1608

loliverhennigh · 2026-05-05T23:42:29Z

\blossom-ci

loliverhennigh · 2026-05-06T18:48:57Z

\blossom-ci

loliverhennigh · 2026-05-06T20:57:35Z

Hi @coreyjadams - quick follow-up on this PR. The latest GitHub Actions checks are green, and I re-triggered Blossom with \blossom-ci, but I still do not see a Blossom status/check showing up on the PR. Could you take a look when you have a chance, or let me know if there is another validation step I should trigger?

loliverhennigh · 2026-05-07T21:54:22Z

\blossom-ci

loliverhennigh · 2026-05-07T22:15:47Z

/blossom-ci

coreyjadams · 2026-05-11T15:40:00Z

I took a look at trying to update the tests to align them with the domain parallelism tests; the test coverage here was only ~60 tests, all on CPU, for domain parallelism. It isn't aligned with the torchrun syntax that domain parallel tests expect (look at /test/plugins/distributed_fixtures.py for more info).

The cross operation looks like it's pretty much implemented OK but the tests aren't quite aligned with the way the op testing is done in the domain parallelism suite, so we're falling short there too.

I just don't think there is sufficient test coverage to say if we've achieved domain parallelism on PhysicsNeMo mesh. I don't think this is ready to go for the RC. There are ~2000 tests for mesh, and while we don't need all of them, being able to test against most of the core functionality + things we would want in datapipes, models, preprocessing etc for domain parallelism is, I think, a prerequisite for merge here. 60 tests (3% coverage max, if we're generous) all on CPU I think is probably short?

@peterdsharpe Any suggestions of what core mesh functionality should be prioritized for testing in a distributed mode? I'm thinking we need generic mesh operations (slicing, scatter_ops, vertex_to_cell and the opposite) as well as manifold projection operations probably. What else?

I think we'd also want to be able to have good distributed-mesh-init testing, like how to initialize a mesh with sharded tensors and make sure it works, gradients flow, we can resize / reshape / redistribute it, etc. Core tensor ops are covered with shard tensor already, so just mesh-specific things are fine.

peterdsharpe · 2026-05-11T18:20:49Z

Yeah, I think these are all great calls @coreyjadams . Re Mesh testing, the high-risk areas that might be good to add testing coverage mesh for a sharded Mesh are as follows:

test/mesh/mesh and test/mesh/geomery - these are kind of the "most core" Mesh data structure ops (cheap, unlikely to be broken by sharding, but extremely widespread usage)
test/mesh/test_transformations.py - geometry transformations. This is low-risk (in the sense that I'm very confident this will work fine with a sharded Mesh), but the stakes are high: these are super critical in datapipes.
test/mesh/subdivision - this involves allocating a new-size Mesh, so potentially some things we'd want to sanity-check here - do this subdivisions happen evenly; do children live on the same device as parents;
test/mesh/calculus - involves message-passing across shard boundaries
test/mesh/neighbors - this is a very important one that is probably the most direct stress-test of contiguous topology across shards
test/mesh/smoothing - like calculus, is also a nonlocal operation
test/mesh/repair - things like "are duplicate vertices correctly detected if they're duplicated across shards"
test/mesh/spatial - something like a BVH tree construction could get dicey with sharded ops?
test/mesh/boundaries - watertightness checks might be interesting on a Sharded mesh?

If it's possible to add coverage for these tests in a relatively low-code-duplication way, that would be great - we definitely don't want a full copy of this test suite that will inevitably fall out of sync. Maybe a pytest.mark.parameterize decorator could work here?

loliverhennigh · 2026-05-11T20:01:46Z

/blossom-ci

This reverts commit 8bc39bf.

peterdsharpe · 2026-05-11T20:31:27Z

+)
+
+
+def _surface_mesh() -> Mesh:


Would recommend using the meshes from physicsnemo.mesh.primitives here, which are intended for re-use and in-general can have much more interesting behavior (to more easily catch edge cases)

loliverhennigh · 2026-05-11T20:45:56Z

/blossom-ci

mesh: enable ShardTensor support for mesh conversion/geometry paths

1f310c6

loliverhennigh requested review from coreyjadams and peterdsharpe as code owners April 30, 2026 00:53

loliverhennigh commented Apr 30, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

mesh: remove unnecessary optional-import wrappers in shard tests/util…

9911959

…ities

peterdsharpe reviewed Apr 30, 2026

View reviewed changes

coreyjadams requested changes Apr 30, 2026

View reviewed changes

Oliver Hennigh and others added 10 commits May 1, 2026 10:54

mesh: address ShardTensor review feedback

446d524

mesh: match cross dim validation with PyTorch

4d17ef2

mesh: register ATen cross handlers for ShardTensor

009e6c3

mesh: clean up ShardTensor cross registration

559ad0c

mesh: make shard mesh tests explicit

0c9dc3d

Merge remote-tracking branch 'upstream/main' into pr-1608

e0c7dd3

mesh: polish ShardTensor tests

2370257

Merge branch 'main' of https://140.82.112.3/NVIDIA/physicsnemo into p…

0c6ffd8

…r-1608

mesh: apply ruff format to shard tests

175d87e

Merge branch 'main' into mesh-shardtensor-mesh-support

4eadb78

peterdsharpe reviewed May 6, 2026

View reviewed changes

Comment thread physicsnemo/mesh/utilities/_scatter_ops.py

peterdsharpe reviewed May 6, 2026

View reviewed changes

Comment thread test/mesh/mesh/test_geometry_properties.py

peterdsharpe reviewed May 6, 2026

View reviewed changes

Comment thread test/mesh/mesh/test_geometry_properties.py Outdated

Oliver Hennigh added 2 commits May 7, 2026 14:52

Merge remote-tracking branch 'upstream/main' into pr-1608

462c432

Address ShardTensor geometry review cleanup

1980993

coreyjadams added the ! - Release PRs or Issues releating to a release label May 11, 2026

Merge branch 'main' into mesh-shardtensor-mesh-support

0e2684e

coreyjadams removed the ! - Release PRs or Issues releating to a release label May 11, 2026

test mesh shardtensor operation coverage

8bc39bf

Revert "test mesh shardtensor operation coverage"

ff70a94

This reverts commit 8bc39bf.

peterdsharpe reviewed May 11, 2026

View reviewed changes

Oliver Hennigh added 2 commits May 11, 2026 13:44

Parameterize mesh tests for shard tensor modes

3fc807c

Merge remote-tracking branch 'upstream/main' into HEAD

92b6f31

Parameterize mesh tests globally by tensor mode

45b3ca4



		def _is_sharded_tensor(tensor: torch.Tensor) -> bool:
		return hasattr(tensor, "_spec") and hasattr(type(tensor), "from_local")

		)


		def _mesh_to_mode(mesh: Mesh, *, mesh_tensor_mode: str, mesh_shard_device_mesh) -> Mesh:



		@pytest.fixture(params=_MESH_TENSOR_MODES)
		def mesh_tensor_mode(request):



		@pytest.fixture
		def mesh_shard_device_mesh(mesh_tensor_mode, _single_rank_dist_group):



		def _convert_data_for_mode(
		data: dict[str, torch.Tensor] \| None,


		return _ToTorchTensor.apply(self, grad_placements)

		def new_replicated_zeros(

		weight_sums = torch.zeros(n_dst, dtype=dtype, device=device)
		weight_sums = _replicated_zeros_like(src_data, (n_dst,), dtype)

Conversation

loliverhennigh commented Apr 30, 2026

PhysicsNeMo Pull Request

Description

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

loliverhennigh commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterdsharpe Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterdsharpe Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterdsharpe Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterdsharpe Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coreyjadams left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loliverhennigh commented May 5, 2026

Uh oh!

loliverhennigh commented May 6, 2026

Uh oh!

loliverhennigh commented May 6, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

peterdsharpe Apr 30, 2026 •

edited

Loading

peterdsharpe Apr 30, 2026 •

edited

Loading

peterdsharpe Apr 30, 2026 •

edited

Loading

peterdsharpe Apr 30, 2026 •

edited

Loading