Skip to content

[DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin#5934

Draft
hujc7 wants to merge 90 commits into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-integration-diagnostic
Draft

[DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin#5934
hujc7 wants to merge 90 commits into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-integration-diagnostic

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented Jun 3, 2026

Purpose

Diagnostic-only integration PR. Bundles every fix from the multi-GPU PR chain
onto a single branch and runs the multi-GPU CI workflow end-to-end against the
two failure-prone files (newton + physx test_articulation) to confirm the
chain as a whole is green.

DO NOT MERGE. Each individual fix has its own PR; this branch is for
A/B validation only.

What's bundled

Workflow wiring

Test narrowing (TEMP)

The discover step now keeps only test_articulation.py (newton + physx
variants) so each CI iteration is bounded by the file that triggers the
upstream hang. Reverts to the full discovered set before any merge.

What we expect to see

  • Both newton AND physx test_articulation run on cuda:1/2/3
  • 3 shards complete in ~10-15 min each
  • No SIGHUP, no shutdown_hang, no "Stage already attached"
  • If ISAACLAB_FORCE_EXIT_TIMEOUT=30 triggers (it shouldn't with PIN_KIT_GPU
    in place), we see the new diagnostic line [isaaclab.app] ISAACLAB_FORCE_EXIT_TIMEOUT=30s expired during atexit_close; libc.kill(self, SIGKILL)

hujc7 added 30 commits May 26, 2026 23:10
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests with
ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1
-parameterised tests added in isaac-sim#5514.

The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero
CUDA device indices.  They currently hang indefinitely on real
multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell
and on the multi-GPU runner pool); the 60-min workflow timeout will
cancel the job and surface the regression in CI for the
FabricFrameView maintainers.

Install pipeline matches isaac-sim#5738's proven-working layout:
- Pin Python 3.12 via SHA-pinned actions/setup-python.
- Pre-install cmake via pip to skip install.py's sudo apt-get branch.
- ./isaaclab.sh --install none (core only, avoids egl_probe libEGL).
- pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION
  || '6.0.0'} --extra-index-url https://pypi.nvidia.com.
- Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA
  / ISAAC_SIM_HEADLESS.

Status: this PR is expected to fail with the 60-min workflow timeout.
Land once the underlying hang in fabric_frame_view.py is fixed.
Adds a single helper, cuda_test_devices(), that converts a 3-position
device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the
list of device strings tests parametrize over.  Single-GPU CI sees no
change (default mask '110' resolves to [cpu, cuda:0], identical to the
hardcoded lists tests carry today).  The new multi-GPU-pytest workflow
sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only.

Mask grammar: each position is 0 or 1, optional trailing X expands to
all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}.
Strict mode raises on missing devices; non-strict returns empty for
opt-in tests that should skip on hosts that can't satisfy them.

P0 migration (pure-Python utility tests, no Kit):

* source/isaaclab/test/utils/test_math.py: 45 parametrize sites +
  2 inline for-loops migrated.
* source/isaaclab/test/utils/test_wrench_composer.py: 37 sites.
* source/isaaclab/test/utils/test_episode_data.py: 5 sites.

Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed
or tuple form) with cuda_test_devices().  Migration is additive - one
import line per file plus the inline edits.  No test logic changes.

Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the
[self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001.
Triggered on changes to the helper, the P0 test files, or the
workflow itself.

Excluded scope (to follow up after CI validates this MVP):

* P1 light-Kit tests (test_simulation_context, test_views_xform_prim,
  test_newton_model_utils, test_views_xform_prim_newton).
* P2 asset tests (test_articulation / test_rigid_object on physx and
  newton backends).
* FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the
  SelectPrims deadlock there is tracked independently.

Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml
edits that were carried on this branch from the earlier PR scope; that
demo is independent of this framework work.
Migrate 16 additional test files (P0 extras + P1 + P2 + P3) to call
cuda_test_devices() in their device parametrize, covering ~280 sites
across articulation/rigid-object/rigid-object-collection/sim/sensors
suites for physx, newton, and ovphysx backends.

Rewrite the workflow's run step to auto-discover any test file calling
cuda_test_devices() via grep, so new opt-ins land without workflow
edits. Files are split into a pure-Python pytest session and per-file
Kit-bound invocations (Kit is a process-wide singleton). A hardcoded
SKIP list parks the known-broken FabricFrameView cuda:1 path.

Per-Kit-file timeout 600 bounds any single hang at 10 minutes so the
job surfaces all failing files rather than blocking on the first.
pytest is not pulled in by --install none or by isaacsim[all,extscache].
Runner state was masking this; pin it explicitly.
flaky and pytest-mock are declared in source/isaaclab/setup.py
install_requires but pip's resolver was silently skipping them when
combined with the pytorch/nvidia extra-index urls in the install step.
Pin them explicitly so the multi-GPU runner is runner-state independent.

SKIP four newton test files that the cuda:1 cold-runner surfaces as
broken (test_contact_sensor hits a pre-existing measure_total kwarg
bug; test_articulation segfaults; test_rigid_object_collection and
test_views_xform_prim_newton have cuda:1 specific failures).  They're
still parametrized via cuda_test_devices() so single-GPU CI continues
to cover cpu+cuda:0.

Accept pytest exit code 5 (no tests collected) so module-level
pytestmark skips (e.g. backend-availability gates in ovphysx) and
device-only parametrize that resolves to [] on incompatible hosts both
count as success.
Four additional test files surfaced cuda:1-specific failures or hangs
on the multi-GPU runner:

* test_simulation_context: passes test_init[cuda:1] then hangs the
  next parametrize variant, 10-min per-file timeout fires.
* newton/test_rigid_object: 41 cuda:1 failures (out of 54).
* physx/test_rigid_object: passes test_initialization[cuda:1-1] then
  hangs at [cuda:1-2] (env-count 2 on cuda:1).
* physx/test_rigid_object_collection: same hang signature.

They keep their cuda_test_devices() parametrize so single-GPU CI
continues to exercise cpu+cuda:0; only multi-GPU CI skips them
pending separate investigation.
When the caller doesn't pass device= explicitly, AppLauncher now
falls back to the ISAACLAB_SIM_DEVICE env var (if set) instead of the
hardcoded cuda:0 default. Kit's active_gpu / physics_gpu are
process-global and locked after SimulationApp init, so per-test
parametrize alone cannot retarget GPU selection once the app is up.
Boot-time alignment is the only path that works.

The multi-GPU pytest workflow now sets ISAACLAB_SIM_DEVICE=cuda:1
alongside ISAACLAB_TEST_DEVICES=001, so PhysX and Warp pin to cuda:1
from process start.

Drops 7 entries from the SKIP list (5 cuda:1 hangs around
active_gpu/cuda mismatch + 2 newton suites likely sharing the same
root cause). Remaining SKIPs:
* FabricFrameView (usdrt SelectPrims cuda:0-only, upstream Kit)
* newton/contact_sensor (Newton PR isaac-sim#2135 measure_total rename, needs
  caller update in newton_manager.py — tracked separately).
Round-5 CI confirmed the AppLauncher fix unblocks test_simulation_context
(was hanging at second parametrize, now 42 passed in 62 s). Other
files surfaced separate, non-AppLauncher root causes that need
independent fixes:

* Newton suites (4 files): Warp allocator failure inside
  mujoco_warp.collision_driver on cuda:1. Reproduces locally on a
  3-device MIG host; root cause is the Warp/mujoco_warp interaction,
  not AppLauncher routing.
* PhysX suites (3 files): hang at test_initialization[cuda:1-2] only
  on the AWS multi-GPU runner. Passes in 11 s locally with the same
  code, so the hang is runner-specific (L40 driver / peer access /
  PCIe topology), not an IsaacLab bug.

test_simulation_context stays in scope (the AppLauncher fix made it
pass deterministically). FabricFrameView usdrt and contact_sensor
Newton API rename remain in SKIP for their pre-existing root causes.
Replaces the pip-install path with ECR pull of the same isaac-lab CI
image used by build.yaml.  ECR auths via the runner EC2 instance's
IAM role (no nvcr.io credentials required at PR-time), so fork PRs
work without exposing NGC_API_KEY.

Benefits:
* Newton 1.2+ pre-installed in the image, fixing the contact_sensor
  measure_total kwarg mismatch without a manual pip pin.
* Eliminates the 9 min cold pip-install step (image pull from ECR
  is tens of seconds when cached).
* Dep matrix matches single-GPU CI exactly, so both gates surface
  the same kind of dep skew.

If ECR cache misses (e.g. build.yaml hasn't completed first), the
action falls back to local build; that path is slow and requires
NGC_API_KEY.  Validating ECR auth on the multi-gpu runner pool is
the primary goal of this commit — drops contact_sensor from SKIP to
test that the version skew is resolved.
The previous attempt missed:
1. The isaac-sim base image has an ENTRYPOINT that launches Kit, so
   bash -c '...' was passed as Kit's argv (Kit booted, my script
   never ran). Mirror run-tests action: --entrypoint bash +
   -c '...'.
2. tools/conftest.py's pytest_ignore_collect returns True for
   every file (subprocess-per-test runner), so pytest collects 0
   items and exits. Pass --ignore=tools/conftest.py
   --ignore=source/isaaclab/test/install_ci, same as run-tests does.
ecr-build-push-pull only pulls locally on the exact-cache-hit path.
On deps-cache-hit (registry-side alias) the image isn't local, so
docker run fails with 'Unable to find image isaac-lab-ci:... locally'
followed by an unauthed Docker Hub pull attempt.  Explicit pull via
the ECR URL covers all paths uniformly.
The ecr-build-push-pull action cleans up its temp DOCKER_CONFIG after
running, so the docker login from inside the action does not persist.
Re-authenticate via aws ecr get-login-password (works via the runner
EC2's IAM role, no AWS creds in the workflow).
Runner's default ~/.docker/config.json declares a credential helper
that fails with 'not implemented'.  Mirror the same workaround the
ecr-build-push-pull action uses: drop a fresh config with credsStore
set to empty string, then docker login + pull work.
Image's default USER is isaaclab (uid 1000), which doesn't own the
volume-mounted host workspace, so it can't ln -s _isaac_sim (perm
denied) — falling back to PATH python3 which doesn't exist in the
image, hence pytest exit 127.
Running container as host uid:gid means the image's default /root
home is not writable, so Warp/numpy/pip cache writes hit
PermissionError [Errno 13] '/root/.cache'. Mount a fresh tmp dir
and point HOME + XDG_CACHE_HOME at it.
The docker image properly installs ovphysx, so module-level
pytestmark.skipif now triggers (no backend init at multi-gpu)
collecting 0 items / 1 skipped. isaaclab.sh's CLI wrapper translates
that exit-5 to exit-1, breaking the workflow's is_ok() check. Skip
the 3 ovphysx files here.
Per ~/.claude/skills/pr/ci-iteration-shortcut.md.  All gated Docker +
Tests jobs (single-GPU build/test matrix) skip via their existing
if-gate. Revert before final review.

PR 5823 iterates the multi-GPU pytest docker conversion; the heavy
single-GPU matrix adds no signal to that work and costs 30+ runner
minutes per push.
Replaces the 237-line custom workflow with the ~100-line shape used by
single-GPU test jobs: pull image via ecr-build-push-pull, run pytest
in container via run-package-tests + run-tests, let tools/conftest.py
handle the Kit-singleton subprocess-per-test pattern.

The 8 docker-runtime quirks I worked around in the previous attempt
(ENTRYPOINT, conftest ignore, deps-cache pull, DOCKER_CONFIG cleanup,
credsStore, uid mismatch, HOME, exit-5 propagation) are all already
handled inside the run-tests action. No reinvention.

Adds one input to run-tests + run-package-tests: extra-env-vars
(multiline KEY=value), used here to inject ISAACLAB_TEST_DEVICES and
ISAACLAB_SIM_DEVICE so the container's pytest parametrize and Kit
boot align on cuda:1.

Test scope: 9 opt-in basenames covering ~512 cuda:1 tests, same
discovery scope as the previous attempt minus the 11 SKIPped files.
ecr-build-push-pull's deps-cache-hit path only creates a registry-side
alias (no local pull). Without a prior build job that establishes the
exact-commit tag in ECR, the test job's internal ecr-build-push-pull
hits exact-cache-miss + deps-cache-hit and leaves no local image, so
docker run fails with 'pull access denied'.

Mirrors the build → test split that build.yaml already uses for the
single-GPU matrix. Build job pre-populates the exact tag (via
buildx imagetools create on deps-cache-hit, or full build on miss);
test job's inner ecr-build-push-pull then hits exact-cache-hit and
pulls locally via the action's existing 'Pull exact image' step.
* Re-applies the run_docker_tests='false' guard in build.yaml's
  changes job (per pr/ci-iteration-shortcut) so the single-GPU
  Docker + Tests matrix skips during this iteration.
* Adds test_views_xform_prim_fabric.py to the multi-GPU include-files
  list. Previously SKIPped because pip-install rounds hung on a
  usdrt SelectPrims cuda:1 deadlock; the docker image carries a
  newer Kit, so the cuda:1 path is worth re-validating here.

Must be reverted before final review.
The docker image's newer Kit resolves the usdrt SelectPrims cuda:1
deadlock that previously kept this file in the SKIP list (pip-install
rounds hit it). Run 26587461494 passed: 36 passed, 3 skipped, 2
xfailed for test_views_xform_prim_fabric.py.

This also restores build.yaml's changes detection (drops the temp
TEMP-iteration skip).
Per ~/.claude/skills/pr/ci-iteration-shortcut.md. Keep the single-GPU
Docker + Tests matrix disabled until iteration is over and the PR is
ready to land. Revert as the last commit before merge.
Folds the helper into the existing isaaclab.test subpackage shape
(sibling of isaaclab.test.benchmark, isaaclab.test.mock_interfaces)
under a new isaaclab.test.utils subpackage. Drops the standalone
isaaclab.testing folder, which was a new top-level namespace with no
precedent.

Import path: from isaaclab.test.utils import cuda_test_devices.
Implements Greptile P2.1 + P2.2 and consolidates the device-skip
mechanism inside test files so the workflow needs no opt-in or
opt-out edits.

API:
* cuda_test_devices() default strict=False — CPU-only dev hosts now
  collect the cpu variant cleanly instead of failing at pytest
  collection (Greptile P2.1).
* cuda_test_devices(skip={device: reason}) — wraps unsupported
  variants in pytest.param(..., marks=pytest.mark.skip(reason=...))
  so pytest still collects them and shows SKIPPED with the reason in
  CI output. Per-call granularity; reason co-located with the test.

Workflow:
* Auto-discovery via grep for cuda_test_devices callers; no SKIP list
  in the workflow. Adding/removing a test from multi-GPU scope is a
  test-file-only edit.

run-tests action:
* extra-env-vars parser now skips only full-line comments (no mid-line
  # stripping) and doesn't xargs-collapse whitespace (Greptile P2.2).

Test file migrations:
* 7 previously-SKIPped files (4 newton + 3 physx) now declare a
  module-level _CUDA_1_BROKEN dict with a tracking-issue URL and apply
  cuda_test_devices(skip=_CUDA_1_BROKEN) per parametrize site.
* test_views_xform_prim_fabric.py migrated to the helper too (was
  using the legacy ISAACLAB_TEST_MULTI_GPU env var pattern), so
  auto-discovery picks it up.
Earlier rounds SKIPped 7 newton+physx files based on failures observed
on the pip-install path or pre-docker rounds. The docker image carries
newer Kit + Newton + Warp that already resolved 2 other categories
(measure_total kwarg, FabricFrameView usdrt deadlock). Re-running the
cuda:1 variants of these 7 files to see which actually still fail on
the docker path.
Enable-all run (0118ea7) confirmed the docker image resolves the
PhysX hangs and FabricFrameView/contact_sensor failures that earlier
rounds suspected. Two narrow categories remain:

* 4 newton files — Warp/mujoco_warp init-order on cuda:1 (issue isaac-sim#5132).
  Same root cause across all four; gated via module-level
  _NEWTON_5132 dict.
* 1 PhysX test — test_rigid_body_no_friction[cuda:1-*] precision
  drift (1.8e-3 vs 1e-5 tolerance); gated via per-test
  _PHYSX_NO_FRICTION_CUDA1 dict.

Everything else previously SKIPped now runs and passes on cuda:1
(test_articulation, test_rigid_object except no_friction,
test_rigid_object_collection, test_views_xform_prim_fabric).
Add torch.cuda.set_device(device) + wp.set_device(device) at the
top of NewtonManager.start_simulation and initialize_solver
so mujoco_warp's collision pipeline allocates against an initialized
cuda:N primary context. Also pass device=device to the standard-
path wp.ScopedCapture (the relaxed-graph sibling already did this).

Local repro confirms: isaaclab_newton/test/assets/test_rigid_object.py
on cuda:1 was 41 failed; now 45 passed / 9 skipped / 0 failed.

Also fixes the test_rigid_body_no_friction tolerance branch in
both isaaclab_physx and isaaclab_newton test_rigid_object.py
files. The author already documented GPU non-determinism and set
tolerance = 1e-2 for cuda:0; the else branch fell through to
the CPU-tight 1e-5 on cuda:1, where PhysX's GPU integrator drift is
the same 1.8e-3 envelope. Gate on device.startswith('cuda') so
all cuda devices share the same loose tolerance.

Drops the temporary _NEWTON_5132 and _PHYSX_NO_FRICTION_CUDA1
skip dicts from the 5 test files now that the underlying bugs are
fixed.

Tracks: isaac-sim#5132.
hujc7 added 17 commits May 31, 2026 02:58
Restore the MULTI_GPU_SKIP_REASON marker on the physx variant only.
Newton test_articulation drops AppLauncher entirely via PR isaac-sim#5883, so it
runs cleanly under concurrent multi-GPU. The physx variant must still
boot Kit for omni.physics; under 3-shard concurrent CI runners (shared
GPU visibility) Kit's shutdown hangs >52s, causing SIGHUP cascade across
sibling shards and "Stage already attached" errors.

Cross-linked upstream at IsaacLab isaac-sim#3475 / OMPE-43816 (deferred past
Isaac Sim 5.0 per the engineering thread).
Bundles the kitless conversion of newton test_articulation +
test_rigid_object_collection into the dynamic-sharding branch so the
multi-GPU CI workflow actually exercises a non-Kit-booted newton
test_articulation alongside the physx skip. Will rebase away when isaac-sim#5883
lands.

Includes the universal schemas.py fix
(``_create_fixed_joint_to_world`` replaces unguarded
``omni.physx.scripts.utils.createJoint``) and the .skip changelog
fragments for the test-only packages.
Adds ``_capture_hang_stacks(pid, pgid, kill_reason)`` and calls it
from the hang-detection path (startup_hang / shutdown_hang / timeout)
before SIGKILL erases the evidence. Captures:

* ``py-spy dump --pid`` -> Python frames showing where Python code is
  parked inside ``app.close()`` or pytest teardown.
* ``gdb -batch -ex "thread apply all bt" -p`` -> C++ frames inside
  ``omniverse_kit`` / ``omni.physx.plugin`` / CUDA driver binaries.
  Critical because Kit core is closed source — without this we have no
  way to localize the hang in IsaacLab isaac-sim#3475 / OMPE-43816.

Walks the entire process group (capped at 8 pids) so any Kit extension
helper child that's the actual culprit is also dumped. Each tool is
optional: missing py-spy or gdb is reported inline rather than
failing the diagnostic capture.

No behavioral change to passing runs. Output lands in the same
``pre_kill_diag`` block that already gets attached to the JUnit error
report when a kill fires.
Two small additions to make ``tools/conftest.py``'s hang capture
actually work in CI:

* ``--cap-add=SYS_PTRACE`` on the per-shard ``docker run``: required
  for ``py-spy dump`` and ``gdb -p`` to attach to the hung Kit process.
  Without it both tools come back as "Permission Denied" (verified
  locally on a synthetic hung subprocess).

* ``py-spy`` added to the in-container ``pip install`` list so the
  capture function can find it on PATH. ``gdb`` is already present in
  the ECR image.

The capture is gated by the existing ``shutdown_hang`` /
``startup_hang`` / ``timeout`` detection in conftest, so on green runs
neither tool is invoked.
After dropping the cherry-pick of the kitless newton conversion to keep
this PR scoped to CI infra, the newton variant of test_articulation
once again boots Kit at module level and is subject to the same
concurrent-Kit shutdown hang / SIGHUP cascade as the physx variant.

Restore the ``MULTI_GPU_SKIP_REASON`` marker on the newton variant so
the multi-GPU discover-step filter excludes it. The marker comment
points at isaac-sim#5883, which
removes the AppLauncher boot from this file and lets the kitless
SimulationContext path carry the test. After isaac-sim#5883 lands and this PR
rebases on develop, the marker can be dropped in the same commit that
re-enables it.

Both newton and physx test_articulation are now consistently skipped
from multi-GPU; both still run in single-GPU CI.
The default ``apps/isaaclab.python.headless.kit`` sets
``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable
= true``, so each Kit process enumerates every visible GPU at startup.
Under concurrent multi-GPU CI shards (``--gpus all`` per container, one
Kit per non-default cuda device), that produces a shared cubric /
PhysX-fabric GPU-interop context across sibling processes -- surfacing
as ``[Error] [omni.physx.plugin] Stage X already attached`` mid-test
and ``SimulationApp.close`` hanging >52s in teardown.

Tracked upstream at IsaacLab isaac-sim#3475 / NVBug 5687364. Kelly Guo's
documented WAR (#omni-kit thread, 2024-2025): set
``renderer.multiGpu.enabled = false`` + ``maxGpuCount = 1`` so each
Kit only touches its assigned GPU.

Adds opt-in ``ISAACLAB_PIN_KIT_GPU`` env var. When truthy, AppLauncher
appends three flags to the Kit command line:
- ``--/renderer/multiGpu/enabled=False``
- ``--/renderer/multiGpu/autoEnable=False``
- ``--/renderer/multiGpu/maxGpuCount=1``

Off by default; single-GPU and user-facing rendering paths are
unchanged. CI workflows that need bounded resource visibility set
``ISAACLAB_PIN_KIT_GPU=1`` on the runner.

Local validation: Blackwell hardware (current Horde) does not
reproduce the upstream hang due to MIG topology limitations (only 3
torch-visible cuda devices), so the change is shipped as a CI A/B
hypothesis test rather than a verified fix. The implementation is
small, opt-in, and reversible.
…p marker

Wires the new ``ISAACLAB_PIN_KIT_GPU`` env var (from the cherry-picked
mgpu-pin-kit-resources commit) into the per-shard ``docker run`` and
re-enables physx test_articulation in the multi-GPU lane by dropping
its MULTI_GPU_SKIP_REASON marker.

Direct CI A/B for Kelly Guo's documented WAR: if the upstream cubric
/ PhysX-fabric GPU-interop race on shared CUDA contexts is the trigger
for the 52s shutdown_hang + SIGHUP cascade observed in run 26698100037,
pinning each Kit to a single GPU should clear it. Three consecutive
green runs on the same SHA would confirm.
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
…EXIT_TIMEOUT=30

Integration PR speedup: skip every test file other than test_articulation
(both newton and physx variants) so the CI loop is bounded by the file
that actually triggers the upstream hang — single sample ~10 min instead
of ~22 min for the full set.

Also wires ``ISAACLAB_FORCE_EXIT_TIMEOUT=30`` into the per-shard docker
run so isaac-sim#5886's bounded-shutdown timer is exercised (was dormant in
prior runs because the env var was unset).

Revert before this PR can land. The narrowing is gated by a separate
``narrow`` step so reverting is a single-step delete.
@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 3, 2026
@hujc7 hujc7 changed the title [MGPU][DIAGNOSTIC] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin (DO NOT MERGE) [DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin Jun 3, 2026
…gration-diagnostic

# Conflicts:
#	source/isaaclab/isaaclab/app/app_launcher.py
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration Bundle Review — Multi-GPU CI Infrastructure

This diagnostic PR bundles several interconnected fixes for multi-GPU CI. Since it's marked DO NOT MERGE and serves as A/B validation for the individual PRs in the chain (#5881, #5883, #5886, #5933), this review focuses on technical observations and potential improvements for when these land as individual PRs.


Overall Assessment

The architecture is well-designed: the scope ∩ runtime device selection model, dynamic work-stealing queue, Kit GPU pinning, and bounded shutdown form a coherent multi-GPU CI strategy. The commit history demonstrates rigorous iteration (reproducing failures, isolating causes, validating fixes). Several items below are worth addressing in the individual PRs before they land.


Findings

1. 🔴 Force-exit timer uses SIGKILL with no JUnit cleanup guarantee

File: source/isaaclab/isaaclab/app/app_launcher.py (lines ~70-100)

The _arm_force_exit_timer daemon thread issues libc.kill(os.getpid(), 9) (SIGKILL) after timeout. While the docstring notes "callers must have already written their JUnit XML before triggering close()", the atexit hook can fire before the test runner finishes writing XML:

def _atexit_close(app=self._app, force_exit_seconds=force_exit_seconds):
    if force_exit_seconds > 0:
        _arm_force_exit_timer(force_exit_seconds, "atexit_close")
    with contextlib.suppress(Exception):
        app.close()

If atexit fires during a normal exit sequence (not just a hang) and app.close() takes longer than the timeout, SIGKILL will terminate mid-JUnit-write. Consider either: (a) gating the timer behind a _is_hang_scenario flag only set during signal handling, or (b) flushing/closing the JUnit file descriptor before arming the timer in the atexit path.


2. 🟡 _claim_queued_file has a TOCTOU window on NFS/network mounts

File: tools/conftest.py (lines ~396-430)

The work-queue claim uses flock which works correctly on local filesystems and overlayfs mounts. However, if the CI runner's workspace is on NFS (or any filesystem where flock degrades to advisory-only), two shards could claim the same file. The current Docker volume mount strategy (-v "$queue_dir:/mgpu:rw") is a host-local tmpfs mount, so this is fine today — but worth documenting the assumption inline so a future refactor doesn't inadvertently break it.


3. 🟡 build.yaml has a hardcoded run_docker_tests: 'false' that must be reverted

File: .github/workflows/build.yaml (lines 77-79)

run_docker_tests: 'false'

The PR description and commit messages acknowledge this is temporary, but there's no CI check or failing assertion that would catch it if someone forgets. The individual PRs that carry pieces of this should not inherit this line. Consider adding a # TEMP: comment pattern that a pre-commit hook or CI lint checks for.


4. 🟡 ISAACLAB_SIM_DEVICE env var override in AppLauncher may surprise users

File: source/isaaclab/isaaclab/app/app_launcher.py (lines ~1093-1100)

if not device_explicitly_passed:
    env_device = os.environ.get("ISAACLAB_SIM_DEVICE")
    if env_device:
        device = env_device
        launcher_args["device"] = env_device

This silently changes the boot device for any script that doesn't pass device= explicitly to AppLauncher. A user who sets this env var for one workflow but then runs interactive scripts in the same shell may get unexpected behavior. The design is correct for CI but warrants mention in the changelog/docs that this env var is for CI automation, not general use. The existing changelog fragment doesn't call out this "ambient override" characteristic clearly.


5. 🟡 _create_fixed_joint_to_world may produce incorrect joint placement for non-identity parent transforms

File: source/isaaclab/isaaclab/sim/schemas/schemas.py (lines ~137-192)

The joint's LocalPos0 and LocalRot0 (world-frame body0 pose) are set from the articulation prim's world transform. This matches the existing omni.physx.scripts.utils.createJoint behavior for the single-selection case. However, the original helper also handles edge cases like prims with non-default pivot points and USD timeSampled transforms. The new implementation uses UsdGeom.XformCache() with no explicit time argument, defaulting to Usd.TimeCode.Default(). If any downstream caller ever sets time-varying transforms before calling modify_articulation_root_properties, the joint will be placed at the default time's pose. This is unlikely to matter today but is a subtle semantic narrowing.


6. 🟡 build_simulation_context default change is a breaking API change

File: source/isaaclab/isaaclab/sim/simulation_context.py (lines ~960-1007)

Changing device: str = "cuda:0" to device: str | None = None is a public API change. Callers that relied on the previous default getting "cuda:0" when passing sim_cfg=None will now get SimulationCfg's default device (which happens to also be "cuda:0" today, but if SimulationCfg ever changes its default, behavior diverges). The migration is safe right now but should be noted as a breaking change in the changelog fragment.


7. 🟢 Newton torch.cuda.set_device + wp.set_device pins are correct but could race

File: source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py (lines ~979-990, ~1255-1265)

The device pinning in start_simulation and initialize_solver is the right fix for issue #5132. The import torch inside the method body is fine (avoids import-time side effects), and pinning both torch and warp before allocations is necessary. One minor concern: if start_simulation and initialize_solver could theoretically be called from different threads (unlikely in Newton's architecture but worth noting), the global torch.cuda.set_device is process-wide and not thread-safe. A CUDA context per-thread approach would be more robust but is overkill for the current single-threaded usage.


8. 🟢 DIAGNOSTIC: test-multi-gpu-pytest.yaml narrow step should be clearly gated

File: .github/workflows/test-multi-gpu-pytest.yaml (step "TEMP narrow to test_articulation only")

The narrow step overrides the discovered file list with only test_articulation.py variants. This is correctly labeled as diagnostic/temp and should be removed before the workflow is used in production. The implementation is clean — it outputs to narrow.outputs which is consumed downstream, making revert straightforward.


9. 🟢 Signal handler robustness: sys.exit() in signal handler is safe here but non-standard

File: source/isaaclab/isaaclab/app/app_launcher.py (lines ~1510-1528)

def _abort_signal_handle_callback(self, signum, frame):
    ...
    with contextlib.suppress(Exception):
        self._app.close()
    sys.exit(128 + signum)

Calling sys.exit() from a signal handler works in CPython (it raises SystemExit which unwinds normally) but is technically not async-signal-safe. In practice this is fine because Python signal handlers run between bytecodes (not during C code), and the previous behavior (returning from the handler without exiting) was strictly worse. The design is correct.


Minor Nits

  • test_views_xform_prim.py: Several tests still have a manual if device == "cuda" and not torch.cuda.is_available(): pytest.skip(...) guard that's now redundant with test_devices() handling — the helper already excludes unavailable devices. These guards could be cleaned up in a follow-up.

  • _expand mask grammar edge case: A mask like "0001" on a 3-device host truncates to [False, False, False] (per the test test_expand). This is intentional (documented) but slightly surprising — a runtime mask pointing past the host's capacity silently becomes empty rather than raising. The ValueError in test_devices() catches the "no devices available" case downstream, so this is fine but worth documenting in the mask grammar section.


Summary

Severity Count
🔴 High 1
🟡 Medium 5
🟢 Low/Info 3

The PR achieves its diagnostic goal. The individual PRs should address the force-exit cleanup ordering (finding #1) and the build.yaml revert safeguard (finding #3) before landing. The build_simulation_context API change (finding #6) needs changelog coverage as a breaking change. Everything else is either informational or a follow-up improvement.


Update (commit fd55ced): The new commits revert several components that were part of the original review:

  • Finding #1 (Force-exit timer): Entirely removed — the _arm_force_exit_timer, _parse_force_exit_seconds, and ISAACLAB_FORCE_EXIT_TIMEOUT mechanism are gone. Original concern is moot.
  • Finding #4 (ISAACLAB_SIM_DEVICE override): Removed from _resolve_device_settings. Original concern is moot.
  • Finding #5 (_create_fixed_joint_to_world): Removed; reverted to omni.physx.scripts.utils.createJoint. Original concern is moot.
  • Finding #9 (Signal handler sys.exit): Handler reverted to bare self._app.close() without sys.exit(128 + signum). Original concern is moot.
  • The SIGHUP handler registration is also removed.
  • Changelog fragments for force-exit and kitless-newton are deleted.
  • Newton tests (test_articulation.py, test_rigid_object_collection.py) reverted to standard AppLauncher(headless=True) boot (no longer kitless).

New change in tools/conftest.py: The JUnit report filename now uses a full-path slug (report_slug = str(test_file).replace("/", "__")) instead of just file_name, preventing collisions when same-basename test files from different packages run concurrently. This is a correct and clean fix — no issues.

Remaining findings still applicable: #2 (flock/NFS assumption), #3 (build.yaml hardcoded false), #6 (build_simulation_context API change), #7 (Newton device pinning), #8 (narrow diagnostic step). These were not touched by the new commits.

⚠️ Regression note: The signal handler (_abort_signal_handle_callback) now only calls self._app.close() without exiting afterward. If a SIGTERM/SIGABRT is received, after close() returns, execution resumes from the interrupted frame — the process does not terminate. The pre-bundle behavior had the same issue, so this is not a regression introduced by this PR, but it is worth noting that finding #9's improvement (explicit exit after close) was removed along with the force-exit code.


Update (commit 8344a02): Single change in tools/conftest.py — adds os.makedirs("tests", exist_ok=True) before writing the full report XML. This prevents a FileNotFoundError when a shard claims zero files from the work queue (the tests/ directory would not have been created by per-test JUnit outputs). Clean, correct fix — no issues.

Remaining findings still applicable: #2 (flock/NFS assumption), #3 (build.yaml hardcoded false), #6 (build_simulation_context API change), #7 (Newton device pinning), #8 (narrow diagnostic step).


Update (a782ec0): Single change removes the "TEMP narrow to test_articulation only" diagnostic step from test-multi-gpu-pytest.yaml and re-wires INCLUDE_FILES/PATHS env vars to reference steps.discover.outputs directly.

  • Finding #8 (Diagnostic narrow step): Removed. The workflow now runs the full discovered test matrix. Clean removal, no issues.

Remaining findings still applicable: #2 (flock/NFS assumption), #3 (build.yaml hardcoded false), #6 (build_simulation_context API change), #7 (Newton device pinning).


Update (6ff3b04): Major refactor of test-multi-gpu-pytest.yaml — collapses the N-container-per-shard architecture into a single container hosting N parallel subshells. Key changes:

  • Shard count now derived dynamically from nvidia-smi -L inside the container (no more CUDA_SHARDS build output)
  • MIG-mode detection on the host gates CUDA_VISIBLE_DEVICES passthrough
  • Per-shard HOME (/tmp/isaaclab-ci-home-${cuda}) provides writable isolation without per-shard Docker volume mounts
  • Torch device-count cross-check caps shards to what's actually addressable

This is a clean simplification that eliminates cross-container race conditions on the shared workspace mount and ~30s of Docker init per shard. No new issues introduced. The queue/flock mechanism is unchanged.

Remaining findings still applicable: #2 (flock/NFS assumption — still correct since mount is host-local), #3 (build.yaml hardcoded false), #6 (build_simulation_context API change), #7 (Newton device pinning).


Update (fb7eb10): Adds container-level HOME=/tmp/mgpu-base-home and PYTHONUSERBASE=/tmp/mgpu-pyuserbase env vars to the Docker run, with corresponding mkdir -p. This fixes pip --user installs being invisible to per-shard subshells (which override HOME for cache isolation but inherit PYTHONUSERBASE). Clean fix for a real issue — the image runs as a host UID with no /etc/passwd entry, so HOME defaults to unwritable /root. No new issues.

Remaining findings still applicable: #2 (flock/NFS assumption), #3 (build.yaml hardcoded false), #6 (build_simulation_context API change), #7 (Newton device pinning).

hujc7 added 8 commits June 3, 2026 05:40
…me files

When two shards in the same multi-GPU CI run pick test files with the
same basename (e.g. isaaclab_newton/test/assets/test_articulation.py vs
isaaclab_physx/test/assets/test_articulation.py), the report path
'tests/test-reports-<basename>.xml' is identical for both. The shared
/workspace/isaaclab mount means both shards write to the same file —
and conftest's hang-detection logic at line ~137 watches
os.path.exists(report_file) to start the SHUTDOWN_GRACE_PERIOD timer.

When shard A (fast / mostly-skipped) writes the report file first, shard
B (still loading Kit + running tests) sees it appear and starts its own
30s grace timer prematurely. Result: shard B gets SIGKILLed mid-test
with kill_reason='shutdown_hang' even though tests are actively running.

Observed in run 26864307397: cuda:3 ran ovphysx test_articulation
(skipped in 2.32s) and wrote the report at 05:34:36. cuda:2 was loading
Kit for the physx variant; the path existed at 05:34:36; grace fired at
05:35:06 → killed pytest mid-test at 05:35:07 with 4 tests in progress.

Fix: derive report file slug from the full relative path (with / and
backslash replaced by __) so collisions across shards are impossible.
Per per-PR minimum-needed analysis:
- isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have;
  isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot).
  Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in
  AppLauncher; drops the workflow env var.
- isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of
  this diagnostic bundle to test whether isaac-sim#5933 alone is enough for
  newton test_articulation (which calls
  build_simulation_context(sim_cfg=, device=) at line 2427, so still
  needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton
  test_articulation kitless conversion and the schemas.py
  _create_fixed_joint_to_world helper.

Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit
XML path-collision fix in conftest. If green, confirms only 4 PRs
are needed for multi-GPU CI green (with test_articulation un-gated).
When a shard's work queue is drained by siblings before it can claim
any file, run_individual_tests returns no xml_reports and tests/ is
never created (it's normally created as a side-effect of pytest
writing per-test JUnit XMLs). The final full_report write then fails
with FileNotFoundError in lxml.

Observed in run 26866196836 cuda:1: 'Found 3 test files after
filtering', then 'failed tests: []', then INTERNALERROR at
full_report.write line 1008.

Fix: os.makedirs('tests', exist_ok=True) before writing. Cheap.
The empty full_report is still meaningful — it confirms cuda:1
participated but had no work.
The narrow-to-test_articulation step skewed work distribution: with only
2-3 files in the queue, flock-based work-stealing gives 0/1/1 split
across 3 shards. cuda:1 ends up with no work; cuda:3 does all the real
testing alone. This obscures the actual multi-shard contention that the
diagnostic is meant to exercise.

Restoring the full discovered set (~10 test files) so each shard runs
3-4 files and we see the chain under realistic CI load. ~30 min wall
per run instead of ~10, but the data is much more meaningful.
Collapse the N-container per-shard fan-out into a single container
that hosts N pytest subshells as parallel siblings. Shard count
derives from nvidia-smi -L inside the container (truth — torch
under-counts MIG slices on the same parent GPU). CUDA_VISIBLE_DEVICES
is set only on MIG hosts; discrete-GPU runners use --gpus all alone.

Hygiene only:
- Single _isaac_sim symlink, single /mgpu queue mount, single docker
  init (~30s saved per shard on cold start).
- Per-shard HOME under /tmp/isaaclab-ci-home-<cuda> for .cache / .local
  / .nvidia-omniverse isolation.
- Container's /isaac-sim/kit/* and .nv/ComputeCache are shared across
  shards; previously per-shard. Acceptable because Kit's writable
  caches are append-mostly and first-write contention is small.

Not claimed: race-fix. PIN_KIT_GPU stays the validated mitigation;
this refactor is hygiene + a CI-side stability sample to come.
Previous push regressed: removed the global -e HOME=/tmp/... from
docker run, so the pre-fan-out pip install (before the per-shard
subshells override HOME) ran without a writable home → pip --user
fallback wrote to /root/.local → permission denied for the non-root
container user.

Set -e HOME=/tmp/mgpu-base-home at the container level. The path is
on tmpfs (1777, world-writable) so the host_uid user can create it.
Per-shard subshells still override to /tmp/isaaclab-ci-home-<cuda>
for test runs.

Detected on CI run 26875915794 attempt 1.
…hards

Previous push installed pip --user under HOME=/tmp/mgpu-base-home but
per-shard subshells override HOME to /tmp/isaaclab-ci-home-<cuda>, so
pytest's import path no longer sees the just-installed packages. Result:
ModuleNotFoundError: junitparser at tools/conftest.py load.

Pin PYTHONUSERBASE=/tmp/mgpu-pyuserbase at the container level so
pip --user writes there and every Python invocation in the container
imports from there, regardless of which HOME the calling subshell has.
Per-shard HOMEs still isolate .cache / .nvidia-omniverse / etc.

Smoke-tested locally with the same image: pip --user respects
PYTHONUSERBASE and subsequent import works with a different HOME.

Detected on CI run 26877916159.
The work queue is now a directory of one-file-per-test (slug-encoded
relative path). Shards claim entries by atomic os.rename(queue/X,
inflight/<shard>/X) — POSIX rename is kernel-serialized on the source
inode, so two shards racing on the same entry get exactly-one-winner
semantics without any lock file or read-rewrite. On a clean test exit
the runner moves the entry to done/<shard>/X, leaving anything still
in inflight/ at job-end as recoverable evidence of a crashed test.

Workflow gains a reconciler step that runs after the docker exits:
asserts queue/ + every inflight/<shard>/ are empty and exits non-zero
when they aren't, with a named list of the orphans. Closes the
silent-drop hole where a shard could pop from queue.txt and crash
before running, leaving no surface signal that a test was skipped.

Local 10/10 PASS on the new design (Blackwell-MIG, 3 shards, 14 files):
all attempts done=14, orphans=0.

ISAACLAB_TEST_QUEUE now names the queue ROOT directory (containing
queue/ + inflight/ + done/), not a queue.txt file path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant