Skip to content

Ci deps group#1634

Open
coreyjadams wants to merge 7 commits into
mainfrom
ci-deps-group
Open

Ci deps group#1634
coreyjadams wants to merge 7 commits into
mainfrom
ci-deps-group

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

This PR opens a pull request to bring our github CI in line with blossom CI.

I had to tweak a few things to get tests to pass so a few folks are getting pulled in as code owner. Please review the changes.

@pzharrington I'm tagging you specifically because I had to tweak the natten tests, for Flex attention backward on CPU? I don't know why that passes on blossom and not github...

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

Address the failures observed after enabling the full extras surface
in CI:

* Drop sparse_dot_mkl entirely.  bsms.py now uses scipy CSR `@` for
  the adjacency-squaring step, which is equivalent to the previous
  dot_product_mkl call without the libmkl_rt runtime dependency.
  Removed from the CI install step, Dockerfile, install-hint registry,
  the four bsms test decorators, and the aero_graph_net README.

* Bump container /dev/shm to 2 GiB.  DALI's multiprocess worker pool
  exhausts the docker default (64 MiB) via SemLock allocations, which
  surfaces as `OSError: No space left on device` in datapipes tests.
  Applied to all 5 container blocks across github-pr.yml and
  github-nightly-uv.yml.

* Force-reinstall the PyG-ecosystem wheels (torch_scatter,
  torch_sparse, torch_cluster, pyg_lib) from the PyG wheel index
  matching the locked torch version.  Without CUDA toolchain visible
  at uv-sync time, uv lands the CPU-only source build, which makes
  FigConvNet (and any model calling segment_csr / similar on CUDA
  tensors) fail with "Not compiled with CUDA support".  URL is held
  in PYG_WHL_INDEX so the torch-version coupling lives in one spot.

* Narrow the natten test_backward skip.  The previous device=="cpu"
  early-skip was too broad; now we wrap the forward call and only
  skip on the specific NotImplementedError raised by FlexAttention's
  CPU-backward guard.  If natten picks a different backend (or
  FlexAttention ever supports CPU backward), the test will run.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR aligns GitHub CI with the internal Blossom CI environment by expanding installed extras, adding CI-only test dependency installation (PyG wheels, moto, scikit-image, etc.), increasing /dev/shm to 2 GiB for DALI tests, and adding Xvfb for VTK/pyvista headless rendering. It also removes the sparse_dot_mkl/libmkl_rt runtime dependency in favour of scipy's native CSR @ operator, fixes a long-standing double-invocation bug in the check_ort_install decorator, adds the missing physicsnemo/deploy/__init__.py, and works around natten's FlexAttention CPU backward limitation in tests.

  • sparse_dot_mkl removal: BistrideMultiLayerGraph now uses adj_mat @ adj_mat on a scipy CSR matrix, which is functionally equivalent for BFS-hop reachability and removes the MKL runtime dependency from the Dockerfile and all @requires_module guards.
  • check_ort_install decorator fix: Removes the pre-existing bug where the wrapped function (get_ort_session, run_onnx_inference) was called twice per invocation; return value was previously discarded on the first call.
  • Natten CPU backward workaround: Both test_natten.py and the Natten2DSelfAttention doctest now guard against FlexAttention does not support backward on CPU so tests skip gracefully rather than error on CPU-only runners.

Important Files Changed

Filename Overview
.github/actions/bootstrap-cudnn-ci/action.yml Adds Xvfb headless display support and rendering packages (libgl1, xvfb, etc.) for VTK/pyvista tests; reorganises apt packages into toolchain/rendering groups with clear comments.
.github/actions/setup-uv-env/action.yml Adds new "Install CI-only test dependencies" step that layers moto, scikit-image, PyG wheels on top of the uv lockfile; hardcodes torch-2.11.0+cu128 wheel index URL which must be bumped in lockstep with the torch pin.
.github/workflows/github-nightly-uv.yml Expands EXTRAS_TAG to full feature set, bumps UV cache key prefix to force rebuild, adds --shm-size=2g to all three job containers.
.github/workflows/github-pr.yml Mirrors nightly changes: expands EXTRAS_TAG, bumps UV cache key prefix, adds --shm-size=2g to job containers.
physicsnemo/deploy/onnx/utils.py Fixes pre-existing double-call bug in check_ort_install decorator; switches ort import to OptionalImport lazy pattern with .available check.
physicsnemo/datapipes/gnn/bsms.py Replaces sparse_dot_mkl.dot_product_mkl with scipy CSR @ operator, removing the libmkl_rt runtime dependency entirely.
physicsnemo/deploy/init.py New empty init.py (license header only) making physicsnemo.deploy a proper package.
test/nn/functional/test_natten.py Wraps na1d/na2d/na3d calls in try/except to skip when natten picks FlexAttention backend on CPU (which does not support backward with requires_grad inputs).
physicsnemo/nn/module/dit_layers.py Wraps Natten2DSelfAttention doctest in torch.no_grad() to avoid FlexAttention CPU backward error in doctest runners.
Dockerfile Removes sparse-dot-mkl from the CI-only install layer.
physicsnemo/core/version_check.py Removes sparse_dot_mkl from _PACKAGE_HINTS install-hint registry.
test/datapipes/test_bsms.py Removes sparse_dot_mkl from @requires_module guard on all three BSMS test functions.
test/models/meshgraphnet/test_bsms_mgn.py Removes sparse_dot_mkl from @requires_module guard on all four test functions.

Comments Outside Diff (1)

  1. physicsnemo/deploy/onnx/utils.py, line 36-47 (link)

    P2 Pre-existing double-call bug fixed here

    Worth calling out explicitly: before this PR the decorator called func(*args, **kwargs) twice — once discarding the return value, then again with return. Every call to get_ort_session created and discarded a first ORT session, and every call to run_onnx_inference ran inference twice. This PR fixes both by removing the redundant bare call. Good catch.

Reviews (1): Last reviewed commit: "Flex attention fix for doctests" | Re-trigger Greptile

Comment on lines +197 to +211
# wheels matching the locked torch version.
- name: Install CI-only test dependencies
shell: bash
env:
UV_LINK_MODE: copy
PYG_WHL_INDEX: "https://data.pyg.org/whl/torch-2.11.0+cu128.html"
# cuml-cu12 -> cudf -> numba caps numpy at <=2.2; uv.lock pins
# numpy 2.2.6 under the cu12 extra precisely for this reason.
# Without an explicit constraint here, uv pip install resolves
# transitive deps (e.g. tensorstore, pyarrow) freely and bumps
# numpy to 2.4.x, which crashes every test that touches cuml.
# Re-applying the cap on each layered install keeps numba happy.
NUMPY_PIN: "numpy<2.3"
run: |
set -euo pipefail
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded torch wheel index URL must track the torch pin

PYG_WHL_INDEX encodes torch-2.11.0+cu128 literally. If the torch pin in uv.lock is ever bumped, this URL silently serves incompatible wheels (or 404s for the new version) until it is also updated. The inline comment already warns about this, but there is no automated guard (e.g., a CI step that extracts the locked torch version and verifies the URL matches) to catch drift. Consider whether a brief assertion or a shared variable derivation from the lockfile would be feasible to make the coupling more explicit.

Copy link
Copy Markdown
Collaborator

@pzharrington pzharrington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Natten tweaks look fine to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants