Ci deps group by coreyjadams · Pull Request #1634 · NVIDIA/physicsnemo

coreyjadams · 2026-05-11T22:05:04Z

This PR opens a pull request to bring our github CI in line with blossom CI.

I had to tweak a few things to get tests to pass so a few folks are getting pulled in as code owner. Please review the changes.

@pzharrington I'm tagging you specifically because I had to tweak the natten tests, for Flex attention backward on CPU? I don't know why that passes on blossom and not github...

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

Address the failures observed after enabling the full extras surface in CI: * Drop sparse_dot_mkl entirely. bsms.py now uses scipy CSR `@` for the adjacency-squaring step, which is equivalent to the previous dot_product_mkl call without the libmkl_rt runtime dependency. Removed from the CI install step, Dockerfile, install-hint registry, the four bsms test decorators, and the aero_graph_net README. * Bump container /dev/shm to 2 GiB. DALI's multiprocess worker pool exhausts the docker default (64 MiB) via SemLock allocations, which surfaces as `OSError: No space left on device` in datapipes tests. Applied to all 5 container blocks across github-pr.yml and github-nightly-uv.yml. * Force-reinstall the PyG-ecosystem wheels (torch_scatter, torch_sparse, torch_cluster, pyg_lib) from the PyG wheel index matching the locked torch version. Without CUDA toolchain visible at uv-sync time, uv lands the CPU-only source build, which makes FigConvNet (and any model calling segment_csr / similar on CUDA tensors) fail with "Not compiled with CUDA support". URL is held in PYG_WHL_INDEX so the torch-version coupling lives in one spot. * Narrow the natten test_backward skip. The previous device=="cpu" early-skip was too broad; now we wrap the forward call and only skip on the specific NotImplementedError raised by FlexAttention's CPU-backward guard. If natten picks a different backend (or FlexAttention ever supports CPU backward), the test will run.

copy-pr-bot · 2026-05-11T22:05:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-11T22:10:22Z

Greptile Summary

This PR aligns GitHub CI with the internal Blossom CI environment by expanding installed extras, adding CI-only test dependency installation (PyG wheels, moto, scikit-image, etc.), increasing /dev/shm to 2 GiB for DALI tests, and adding Xvfb for VTK/pyvista headless rendering. It also removes the sparse_dot_mkl/libmkl_rt runtime dependency in favour of scipy's native CSR @ operator, fixes a long-standing double-invocation bug in the check_ort_install decorator, adds the missing physicsnemo/deploy/__init__.py, and works around natten's FlexAttention CPU backward limitation in tests.

sparse_dot_mkl removal: BistrideMultiLayerGraph now uses adj_mat @ adj_mat on a scipy CSR matrix, which is functionally equivalent for BFS-hop reachability and removes the MKL runtime dependency from the Dockerfile and all @requires_module guards.
check_ort_install decorator fix: Removes the pre-existing bug where the wrapped function (get_ort_session, run_onnx_inference) was called twice per invocation; return value was previously discarded on the first call.
Natten CPU backward workaround: Both test_natten.py and the Natten2DSelfAttention doctest now guard against FlexAttention does not support backward on CPU so tests skip gracefully rather than error on CPU-only runners.

Important Files Changed

Filename	Overview
.github/actions/bootstrap-cudnn-ci/action.yml	Adds Xvfb headless display support and rendering packages (libgl1, xvfb, etc.) for VTK/pyvista tests; reorganises apt packages into toolchain/rendering groups with clear comments.
.github/actions/setup-uv-env/action.yml	Adds new "Install CI-only test dependencies" step that layers moto, scikit-image, PyG wheels on top of the uv lockfile; hardcodes torch-2.11.0+cu128 wheel index URL which must be bumped in lockstep with the torch pin.
.github/workflows/github-nightly-uv.yml	Expands EXTRAS_TAG to full feature set, bumps UV cache key prefix to force rebuild, adds --shm-size=2g to all three job containers.
.github/workflows/github-pr.yml	Mirrors nightly changes: expands EXTRAS_TAG, bumps UV cache key prefix, adds --shm-size=2g to job containers.
physicsnemo/deploy/onnx/utils.py	Fixes pre-existing double-call bug in check_ort_install decorator; switches ort import to OptionalImport lazy pattern with .available check.
physicsnemo/datapipes/gnn/bsms.py	Replaces sparse_dot_mkl.dot_product_mkl with scipy CSR @ operator, removing the libmkl_rt runtime dependency entirely.
physicsnemo/deploy/init.py	New empty init.py (license header only) making physicsnemo.deploy a proper package.
test/nn/functional/test_natten.py	Wraps na1d/na2d/na3d calls in try/except to skip when natten picks FlexAttention backend on CPU (which does not support backward with requires_grad inputs).
physicsnemo/nn/module/dit_layers.py	Wraps Natten2DSelfAttention doctest in torch.no_grad() to avoid FlexAttention CPU backward error in doctest runners.
Dockerfile	Removes sparse-dot-mkl from the CI-only install layer.
physicsnemo/core/version_check.py	Removes sparse_dot_mkl from _PACKAGE_HINTS install-hint registry.
test/datapipes/test_bsms.py	Removes sparse_dot_mkl from @requires_module guard on all three BSMS test functions.
test/models/meshgraphnet/test_bsms_mgn.py	Removes sparse_dot_mkl from @requires_module guard on all four test functions.

Comments Outside Diff (1)

physicsnemo/deploy/onnx/utils.py, line 36-47 (link)

Pre-existing double-call bug fixed here

Worth calling out explicitly: before this PR the decorator called func(*args, **kwargs) twice — once discarding the return value, then again with return. Every call to get_ort_session created and discarded a first ORT session, and every call to run_onnx_inference ran inference twice. This PR fixes both by removing the redundant bare call. Good catch.

_{Reviews (1): Last reviewed commit: "Flex attention fix for doctests" | Re-trigger Greptile}

greptile-apps · 2026-05-11T22:10:29Z

+    # wheels matching the locked torch version.
+    - name: Install CI-only test dependencies
+      shell: bash
+      env:
+        UV_LINK_MODE: copy
+        PYG_WHL_INDEX: "https://data.pyg.org/whl/torch-2.11.0+cu128.html"
+        # cuml-cu12 -> cudf -> numba caps numpy at <=2.2; uv.lock pins
+        # numpy 2.2.6 under the cu12 extra precisely for this reason.
+        # Without an explicit constraint here, uv pip install resolves
+        # transitive deps (e.g. tensorstore, pyarrow) freely and bumps
+        # numpy to 2.4.x, which crashes every test that touches cuml.
+        # Re-applying the cap on each layered install keeps numba happy.
+        NUMPY_PIN: "numpy<2.3"
+      run: |
+        set -euo pipefail


Hardcoded torch wheel index URL must track the torch pin

PYG_WHL_INDEX encodes torch-2.11.0+cu128 literally. If the torch pin in uv.lock is ever bumped, this URL silently serves incompatible wheels (or 404s for the new version) until it is also updated. The inline comment already warns about this, but there is no automated guard (e.g., a CI step that extracts the locked torch version and verifies the URL matches) to catch drift. Consider whether a brief assertion or a shared variable derivation from the lockfile would be feasible to make the coupling more explicit.

pzharrington

Natten tweaks look fine to me

coreyjadams added 7 commits May 7, 2026 16:26

Add full deployment of CI extras. Uses a fresh uv cache. We'll see.

83c4acc

REmove earth2grid

2a02d00

This is intended to fix segfault with pyvista

caa6c93

Cap numpy

fffcdd4

fix onnx import

b0207ab

Flex attention fix for doctests

e3acdc1

coreyjadams requested a review from pzharrington May 11, 2026 22:05

coreyjadams requested review from CharlelieLrt, NickGeneva, ktangsali, loliverhennigh and mnabian as code owners May 11, 2026 22:05

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

pzharrington approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ci deps group#1634

Ci deps group#1634
coreyjadams wants to merge 7 commits into
mainfrom
ci-deps-group

coreyjadams commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

greptile-apps Bot commented May 11, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 11, 2026

Uh oh!

pzharrington left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coreyjadams commented May 11, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

pzharrington left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 11, 2026 •

edited

Loading