[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture by hujc7 · Pull Request #5875 · isaac-sim/IsaacLab

hujc7 · 2026-05-30T01:44:56Z

[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture

Summary

Adds dynamic work-stealing across multi-GPU pytest shards: a single job spawns N concurrent docker containers (one per non-default GPU), each pulls test files from a shared flock-locked work queue and runs them on its assigned GPU. Beats fixed round-robin sharding for wall time when test durations vary.
Auto-discovers opt-in test files (any test_*.py whose test_devices() scope advertises a non-default GPU) so adding a test to multi-GPU coverage needs no workflow edit.
Adds a file-level MULTI_GPU_SKIP_REASON opt-out mechanism for files with known concurrency-only failures; both files keep running in single-GPU CI.
Adds py-spy + gdb stack capture in tools/conftest.py on shutdown_hang / startup_hang / timeout so the upstream Kit shutdown hang (#3475) is observable in CI logs. Walks the process group; safe no-op when py-spy/gdb are missing.
Honors ISAACLAB_SIM_DEVICE env var in AppLauncher so the workflow can boot Kit on cuda:1+ without editing each test's AppLauncher(device=...) call site.

1. Scope

This PR is multi-GPU CI infrastructure only. The actual fixes that unblock the skipped tests live in companion PRs:

Fix PR	What it does	When `[MGPU]` skip marker drops
#5883 — `[MGPU] Tests: make newton-only tests kitless (drop AppLauncher boot)`	Drops `AppLauncher` boot from newton-only tests; SimulationContext's `has_kit()` gate carries them. No Kit lifecycle exposure.	newton `test_articulation`
#5886 — `[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang`	Adds SIGHUP handler + opt-in `ISAACLAB_FORCE_EXIT_TIMEOUT` (force-exit timer).	physx `test_articulation` (when CI sets the env var and the hang is bounded)
#5881 — `[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context`	Fixes device-drift in `build_simulation_context` so newton kernels actually run on the requested cuda:N.	n/a (correctness fix; unblocks downstream tests but no marker drops)

2. CI mechanics

2.1 Dynamic work-stealing

.github/workflows/test-multi-gpu-pytest.yaml runs N parallel docker shards on the same multi-GPU runner. A shared flock-locked queue file in a runtime dir holds test paths; each shard flocks the file, pops the next path, releases, and runs that file on its GPU. No work duplication; balanced wall time even when file durations vary 10x.

2.2 Discover step + opt-in / opt-out

# In .github/workflows/test-multi-gpu-pytest.yaml
mapfile -t candidates < <(grep -rlE 'test_devices\(\)|test_devices\("[^"]*X"\)' source/ --include='test_*.py' | sort -u)
discovered=()
skipped=()
for f in "${candidates[@]}"; do
  if grep -q '^MULTI_GPU_SKIP_REASON' "$f"; then
    skipped+=("$f")
  else
    discovered+=("$f")
  fi
done

Any test that calls test_devices() argless or with a non-default-GPU mask (e.g. "..X") is in scope. To exclude a file from multi-GPU lane while keeping it in single-GPU CI, declare a module-level MULTI_GPU_SKIP_REASON = "...". The discover step prints each excluded file as a ::notice with the reason.

2.3 Hang stack capture

tools/conftest.py:_capture_hang_stacks(pid, pgid, kill_reason) runs before the kill path's SIGKILL. Invokes py-spy dump --pid and gdb -batch -ex "thread apply all bt" -p <pid> against every pid in the process group (cap 8). Each tool is optional; missing tools report inline rather than crashing the diagnostic.

The output gets attached to the JUnit error report so a CI failure surfaces both Python and C++ frames at the moment of the hang — critical because Kit core is closed-source and the hang otherwise terminates with no actionable diagnostic.

Workflow adds --cap-add=SYS_PTRACE to the per-shard docker run (required for both tools to attach) and py-spy to the in-container pip install line (gdb is already in the image).

2.4 AppLauncher honors `ISAACLAB_SIM_DEVICE`

When the test's AppLauncher() call site doesn't pin a device, the env var becomes the device. Used by the workflow so each shard's container can -e ISAACLAB_SIM_DEVICE=cuda:N without editing test source.

3. Files currently skipped from multi-GPU lane

Both test_articulation.py variants carry the MULTI_GPU_SKIP_REASON marker pointing at #3475:

source/isaaclab_newton/test/assets/test_articulation.py — marker is removed once #5883 lands (test goes kitless, no Kit lifecycle exposure).
source/isaaclab_physx/test/assets/test_articulation.py — needs Kit (physx physics is a Kit extension); marker drops once the upstream hang is fixed OR #5886 is set with ISAACLAB_FORCE_EXIT_TIMEOUT=10 on the runner and proven to clear the failure.

4. Validation

Three consecutive CI green at HEAD (before the revert of the cross-PR cherry-picks; the reverts add net-zero file changes vs the green state plus the conftest diagnostics, which only fire on the kill path).

5. Non-scope

The 3 fix PRs above are NOT included. The PR description for each makes their independent scope explicit.
The upstream Kit hang itself is not fixed here; it's just instrumented and worked-around at the harness layer.

Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and wires it to run the FabricFrameView contract tests with ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1 -parameterised tests added in isaac-sim#5514. The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero CUDA device indices. They currently hang indefinitely on real multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell and on the multi-GPU runner pool); the 60-min workflow timeout will cancel the job and surface the regression in CI for the FabricFrameView maintainers. Install pipeline matches isaac-sim#5738's proven-working layout: - Pin Python 3.12 via SHA-pinned actions/setup-python. - Pre-install cmake via pip to skip install.py's sudo apt-get branch. - ./isaaclab.sh --install none (core only, avoids egl_probe libEGL). - pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION || '6.0.0'} --extra-index-url https://pypi.nvidia.com. - Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA / ISAAC_SIM_HEADLESS. Status: this PR is expected to fail with the 60-min workflow timeout. Land once the underlying hang in fabric_frame_view.py is fixed.

Adds a single helper, cuda_test_devices(), that converts a 3-position device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the list of device strings tests parametrize over. Single-GPU CI sees no change (default mask '110' resolves to [cpu, cuda:0], identical to the hardcoded lists tests carry today). The new multi-GPU-pytest workflow sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only. Mask grammar: each position is 0 or 1, optional trailing X expands to all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}. Strict mode raises on missing devices; non-strict returns empty for opt-in tests that should skip on hosts that can't satisfy them. P0 migration (pure-Python utility tests, no Kit): * source/isaaclab/test/utils/test_math.py: 45 parametrize sites + 2 inline for-loops migrated. * source/isaaclab/test/utils/test_wrench_composer.py: 37 sites. * source/isaaclab/test/utils/test_episode_data.py: 5 sites. Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed or tuple form) with cuda_test_devices(). Migration is additive - one import line per file plus the inline edits. No test logic changes. Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the [self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001. Triggered on changes to the helper, the P0 test files, or the workflow itself. Excluded scope (to follow up after CI validates this MVP): * P1 light-Kit tests (test_simulation_context, test_views_xform_prim, test_newton_model_utils, test_views_xform_prim_newton). * P2 asset tests (test_articulation / test_rigid_object on physx and newton backends). * FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the SelectPrims deadlock there is tracked independently. Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml edits that were carried on this branch from the earlier PR scope; that demo is independent of this framework work.

Migrate 16 additional test files (P0 extras + P1 + P2 + P3) to call cuda_test_devices() in their device parametrize, covering ~280 sites across articulation/rigid-object/rigid-object-collection/sim/sensors suites for physx, newton, and ovphysx backends. Rewrite the workflow's run step to auto-discover any test file calling cuda_test_devices() via grep, so new opt-ins land without workflow edits. Files are split into a pure-Python pytest session and per-file Kit-bound invocations (Kit is a process-wide singleton). A hardcoded SKIP list parks the known-broken FabricFrameView cuda:1 path. Per-Kit-file timeout 600 bounds any single hang at 10 minutes so the job surfaces all failing files rather than blocking on the first.

pytest is not pulled in by --install none or by isaacsim[all,extscache]. Runner state was masking this; pin it explicitly.

flaky and pytest-mock are declared in source/isaaclab/setup.py install_requires but pip's resolver was silently skipping them when combined with the pytorch/nvidia extra-index urls in the install step. Pin them explicitly so the multi-GPU runner is runner-state independent. SKIP four newton test files that the cuda:1 cold-runner surfaces as broken (test_contact_sensor hits a pre-existing measure_total kwarg bug; test_articulation segfaults; test_rigid_object_collection and test_views_xform_prim_newton have cuda:1 specific failures). They're still parametrized via cuda_test_devices() so single-GPU CI continues to cover cpu+cuda:0. Accept pytest exit code 5 (no tests collected) so module-level pytestmark skips (e.g. backend-availability gates in ovphysx) and device-only parametrize that resolves to [] on incompatible hosts both count as success.

Four additional test files surfaced cuda:1-specific failures or hangs on the multi-GPU runner: * test_simulation_context: passes test_init[cuda:1] then hangs the next parametrize variant, 10-min per-file timeout fires. * newton/test_rigid_object: 41 cuda:1 failures (out of 54). * physx/test_rigid_object: passes test_initialization[cuda:1-1] then hangs at [cuda:1-2] (env-count 2 on cuda:1). * physx/test_rigid_object_collection: same hang signature. They keep their cuda_test_devices() parametrize so single-GPU CI continues to exercise cpu+cuda:0; only multi-GPU CI skips them pending separate investigation.

When the caller doesn't pass device= explicitly, AppLauncher now falls back to the ISAACLAB_SIM_DEVICE env var (if set) instead of the hardcoded cuda:0 default. Kit's active_gpu / physics_gpu are process-global and locked after SimulationApp init, so per-test parametrize alone cannot retarget GPU selection once the app is up. Boot-time alignment is the only path that works. The multi-GPU pytest workflow now sets ISAACLAB_SIM_DEVICE=cuda:1 alongside ISAACLAB_TEST_DEVICES=001, so PhysX and Warp pin to cuda:1 from process start. Drops 7 entries from the SKIP list (5 cuda:1 hangs around active_gpu/cuda mismatch + 2 newton suites likely sharing the same root cause). Remaining SKIPs: * FabricFrameView (usdrt SelectPrims cuda:0-only, upstream Kit) * newton/contact_sensor (Newton PR isaac-sim#2135 measure_total rename, needs caller update in newton_manager.py — tracked separately).

Round-5 CI confirmed the AppLauncher fix unblocks test_simulation_context (was hanging at second parametrize, now 42 passed in 62 s). Other files surfaced separate, non-AppLauncher root causes that need independent fixes: * Newton suites (4 files): Warp allocator failure inside mujoco_warp.collision_driver on cuda:1. Reproduces locally on a 3-device MIG host; root cause is the Warp/mujoco_warp interaction, not AppLauncher routing. * PhysX suites (3 files): hang at test_initialization[cuda:1-2] only on the AWS multi-GPU runner. Passes in 11 s locally with the same code, so the hang is runner-specific (L40 driver / peer access / PCIe topology), not an IsaacLab bug. test_simulation_context stays in scope (the AppLauncher fix made it pass deterministically). FabricFrameView usdrt and contact_sensor Newton API rename remain in SKIP for their pre-existing root causes.

Replaces the pip-install path with ECR pull of the same isaac-lab CI image used by build.yaml. ECR auths via the runner EC2 instance's IAM role (no nvcr.io credentials required at PR-time), so fork PRs work without exposing NGC_API_KEY. Benefits: * Newton 1.2+ pre-installed in the image, fixing the contact_sensor measure_total kwarg mismatch without a manual pip pin. * Eliminates the 9 min cold pip-install step (image pull from ECR is tens of seconds when cached). * Dep matrix matches single-GPU CI exactly, so both gates surface the same kind of dep skew. If ECR cache misses (e.g. build.yaml hasn't completed first), the action falls back to local build; that path is slow and requires NGC_API_KEY. Validating ECR auth on the multi-gpu runner pool is the primary goal of this commit — drops contact_sensor from SKIP to test that the version skew is resolved.

The previous attempt missed: 1. The isaac-sim base image has an ENTRYPOINT that launches Kit, so bash -c '...' was passed as Kit's argv (Kit booted, my script never ran). Mirror run-tests action: --entrypoint bash + -c '...'. 2. tools/conftest.py's pytest_ignore_collect returns True for every file (subprocess-per-test runner), so pytest collects 0 items and exits. Pass --ignore=tools/conftest.py --ignore=source/isaaclab/test/install_ci, same as run-tests does.

ecr-build-push-pull only pulls locally on the exact-cache-hit path. On deps-cache-hit (registry-side alias) the image isn't local, so docker run fails with 'Unable to find image isaac-lab-ci:... locally' followed by an unauthed Docker Hub pull attempt. Explicit pull via the ECR URL covers all paths uniformly.

The ecr-build-push-pull action cleans up its temp DOCKER_CONFIG after running, so the docker login from inside the action does not persist. Re-authenticate via aws ecr get-login-password (works via the runner EC2's IAM role, no AWS creds in the workflow).

Runner's default ~/.docker/config.json declares a credential helper that fails with 'not implemented'. Mirror the same workaround the ecr-build-push-pull action uses: drop a fresh config with credsStore set to empty string, then docker login + pull work.

Image's default USER is isaaclab (uid 1000), which doesn't own the volume-mounted host workspace, so it can't ln -s _isaac_sim (perm denied) — falling back to PATH python3 which doesn't exist in the image, hence pytest exit 127.

Running container as host uid:gid means the image's default /root home is not writable, so Warp/numpy/pip cache writes hit PermissionError [Errno 13] '/root/.cache'. Mount a fresh tmp dir and point HOME + XDG_CACHE_HOME at it.

The docker image properly installs ovphysx, so module-level pytestmark.skipif now triggers (no backend init at multi-gpu) collecting 0 items / 1 skipped. isaaclab.sh's CLI wrapper translates that exit-5 to exit-1, breaking the workflow's is_ok() check. Skip the 3 ovphysx files here.

Per ~/.claude/skills/pr/ci-iteration-shortcut.md. All gated Docker + Tests jobs (single-GPU build/test matrix) skip via their existing if-gate. Revert before final review. PR 5823 iterates the multi-GPU pytest docker conversion; the heavy single-GPU matrix adds no signal to that work and costs 30+ runner minutes per push.

Replaces the 237-line custom workflow with the ~100-line shape used by single-GPU test jobs: pull image via ecr-build-push-pull, run pytest in container via run-package-tests + run-tests, let tools/conftest.py handle the Kit-singleton subprocess-per-test pattern. The 8 docker-runtime quirks I worked around in the previous attempt (ENTRYPOINT, conftest ignore, deps-cache pull, DOCKER_CONFIG cleanup, credsStore, uid mismatch, HOME, exit-5 propagation) are all already handled inside the run-tests action. No reinvention. Adds one input to run-tests + run-package-tests: extra-env-vars (multiline KEY=value), used here to inject ISAACLAB_TEST_DEVICES and ISAACLAB_SIM_DEVICE so the container's pytest parametrize and Kit boot align on cuda:1. Test scope: 9 opt-in basenames covering ~512 cuda:1 tests, same discovery scope as the previous attempt minus the 11 SKIPped files.

ecr-build-push-pull's deps-cache-hit path only creates a registry-side alias (no local pull). Without a prior build job that establishes the exact-commit tag in ECR, the test job's internal ecr-build-push-pull hits exact-cache-miss + deps-cache-hit and leaves no local image, so docker run fails with 'pull access denied'. Mirrors the build → test split that build.yaml already uses for the single-GPU matrix. Build job pre-populates the exact tag (via buildx imagetools create on deps-cache-hit, or full build on miss); test job's inner ecr-build-push-pull then hits exact-cache-hit and pulls locally via the action's existing 'Pull exact image' step.

This reverts commit 665f0c3.

* Re-applies the run_docker_tests='false' guard in build.yaml's changes job (per pr/ci-iteration-shortcut) so the single-GPU Docker + Tests matrix skips during this iteration. * Adds test_views_xform_prim_fabric.py to the multi-GPU include-files list. Previously SKIPped because pip-install rounds hung on a usdrt SelectPrims cuda:1 deadlock; the docker image carries a newer Kit, so the cuda:1 path is worth re-validating here. Must be reverted before final review.

The docker image's newer Kit resolves the usdrt SelectPrims cuda:1 deadlock that previously kept this file in the SKIP list (pip-install rounds hit it). Run 26587461494 passed: 36 passed, 3 skipped, 2 xfailed for test_views_xform_prim_fabric.py. This also restores build.yaml's changes detection (drops the temp TEMP-iteration skip).

Per ~/.claude/skills/pr/ci-iteration-shortcut.md. Keep the single-GPU Docker + Tests matrix disabled until iteration is over and the PR is ready to land. Revert as the last commit before merge.

Folds the helper into the existing isaaclab.test subpackage shape (sibling of isaaclab.test.benchmark, isaaclab.test.mock_interfaces) under a new isaaclab.test.utils subpackage. Drops the standalone isaaclab.testing folder, which was a new top-level namespace with no precedent. Import path: from isaaclab.test.utils import cuda_test_devices.

Implements Greptile P2.1 + P2.2 and consolidates the device-skip mechanism inside test files so the workflow needs no opt-in or opt-out edits. API: * cuda_test_devices() default strict=False — CPU-only dev hosts now collect the cpu variant cleanly instead of failing at pytest collection (Greptile P2.1). * cuda_test_devices(skip={device: reason}) — wraps unsupported variants in pytest.param(..., marks=pytest.mark.skip(reason=...)) so pytest still collects them and shows SKIPPED with the reason in CI output. Per-call granularity; reason co-located with the test. Workflow: * Auto-discovery via grep for cuda_test_devices callers; no SKIP list in the workflow. Adding/removing a test from multi-GPU scope is a test-file-only edit. run-tests action: * extra-env-vars parser now skips only full-line comments (no mid-line # stripping) and doesn't xargs-collapse whitespace (Greptile P2.2). Test file migrations: * 7 previously-SKIPped files (4 newton + 3 physx) now declare a module-level _CUDA_1_BROKEN dict with a tracking-issue URL and apply cuda_test_devices(skip=_CUDA_1_BROKEN) per parametrize site. * test_views_xform_prim_fabric.py migrated to the helper too (was using the legacy ISAACLAB_TEST_MULTI_GPU env var pattern), so auto-discovery picks it up.

Earlier rounds SKIPped 7 newton+physx files based on failures observed on the pip-install path or pre-docker rounds. The docker image carries newer Kit + Newton + Warp that already resolved 2 other categories (measure_total kwarg, FabricFrameView usdrt deadlock). Re-running the cuda:1 variants of these 7 files to see which actually still fail on the docker path.

Enable-all run (0118ea7) confirmed the docker image resolves the PhysX hangs and FabricFrameView/contact_sensor failures that earlier rounds suspected. Two narrow categories remain: * 4 newton files — Warp/mujoco_warp init-order on cuda:1 (issue isaac-sim#5132). Same root cause across all four; gated via module-level _NEWTON_5132 dict. * 1 PhysX test — test_rigid_body_no_friction[cuda:1-*] precision drift (1.8e-3 vs 1e-5 tolerance); gated via per-test _PHYSX_NO_FRICTION_CUDA1 dict. Everything else previously SKIPped now runs and passes on cuda:1 (test_articulation, test_rigid_object except no_friction, test_rigid_object_collection, test_views_xform_prim_fabric).

Add torch.cuda.set_device(device) + wp.set_device(device) at the top of NewtonManager.start_simulation and initialize_solver so mujoco_warp's collision pipeline allocates against an initialized cuda:N primary context. Also pass device=device to the standard- path wp.ScopedCapture (the relaxed-graph sibling already did this). Local repro confirms: isaaclab_newton/test/assets/test_rigid_object.py on cuda:1 was 41 failed; now 45 passed / 9 skipped / 0 failed. Also fixes the test_rigid_body_no_friction tolerance branch in both isaaclab_physx and isaaclab_newton test_rigid_object.py files. The author already documented GPU non-determinism and set tolerance = 1e-2 for cuda:0; the else branch fell through to the CPU-tight 1e-5 on cuda:1, where PhysX's GPU integrator drift is the same 1.8e-3 envelope. Gate on device.startswith('cuda') so all cuda devices share the same loose tolerance. Drops the temporary _NEWTON_5132 and _PHYSX_NO_FRICTION_CUDA1 skip dicts from the 5 test files now that the underlying bugs are fixed. Tracks: isaac-sim#5132.

Local 2/3/4-MIG cross-GPU threshold sweep confirms the Kit SIGHUP / shutdown-hang bug in test_articulation fires at >=3 concurrent Kit instances (v8 3-MIG repros it identically to CI's 3-shard config; v7 2-MIG passes cleanly). Cap multi-GPU CI's shard count at 2 to stay below the threshold; un-gate test_articulation so it runs on the multi-GPU lane again. Restores ~5min wall-time cost (2 shards instead of 3) in exchange for re-enabling the file.

isaaclab-review-bot

Update (commit e7d1af7): New commit implements a 2-shard cap as a concurrency-threshold workaround for the Kit lifecycle bug.

Changes in This Commit

1. `.github/workflows/test-multi-gpu-pytest.yaml`

Shard count is now capped at 2, regardless of available GPUs:

# TEMP EXPERIMENT: cap shards at 2 to test concurrency-threshold workaround for
# the Kit lifecycle bug (SIGHUP/shutdown-hang fires at >=3 concurrent Kit
# instances; 2 is below the threshold per local 2-MIG cross-GPU repro).
shards=$(python3 -c "import json; cap=2; print(json.dumps(list(range(1, min($N, cap+1)))))")
count=$(python3 -c "cap=2; print(min($N-1, cap))")

2. Test Files Re-enabled

Both test_articulation.py files (isaaclab_newton and isaaclab_physx) now have their MULTI_GPU_SKIP_REASON markers commented out to test whether the 2-shard cap avoids the Kit lifecycle bug.

Assessment

✅ Hypothesis-driven debugging — The approach systematically tests the theory that ≥3 concurrent Kit instances trigger the lifecycle bug, while ≤2 are stable.

✅ Conservative cap — 2 shards still provides multi-GPU coverage (cuda:1, cuda:2) while staying below the suspected threshold.

⚠️ GPU Utilization — On 4+ GPU systems, only 2 of the non-default GPUs will be utilized. This is acceptable as a diagnostic step but should be reverted once the root cause is understood.

Comments on Implementation

Inline Python for cap logic: The cap calculation uses inline Python:
```
count=$(python3 -c "cap=2; print(min($N-1, cap))")
```
This works but is slightly fragile. Consider extracting cap=2 to a workflow-level variable for easier adjustment:
```
env:
  SHARD_CAP: 2  # TEMP: workaround for Kit lifecycle bug
```
Experiment lifecycle: This is now the third commit in the diagnostic cycle (enable → test → observe). The clear comments document the experiment purpose, which is good practice.

Outstanding Items (Cumulative)

Item	Status	Location
`run_docker_tests: 'false'` hardcoding	Pre-existing	build.yaml
45-minute timeout (target 30)	Pre-existing	test-multi-gpu-pytest.yaml
2-shard cap	New (diagnostic)	test-multi-gpu-pytest.yaml
Commented MULTI_GPU_SKIP_REASON markers	New (experiment)	Both test_articulation.py

Verdict

LGTM for experiment ✅ — Clean diagnostic change. The 2-shard cap is a reasonable approach to isolate the concurrency threshold. Results from this CI run will inform whether the cap resolves the issue or if the problem lies elsewhere.

Pre-merge checklist (once experiments conclude):

Revert or formalize the shard cap based on findings
Restore or remove MULTI_GPU_SKIP_REASON markers accordingly
Revert run_docker_tests: 'false' in build.yaml
Adjust 45-minute timeout if validated

Update (commit 4bc8b33): The 2-shard cap experiment has concluded. This commit:

Removes the shard cap — all available GPUs are now used (shards=$(python3 -c "...range(1, $N)"), count=$((N - 1))). The inline Python is simplified and the bash arithmetic is cleaner.
Re-enables MULTI_GPU_SKIP_REASON in both test_articulation.py files — with improved documentation explaining the opt-out mechanism. These files are excluded from concurrent multi-GPU CI while the upstream Kit lifecycle bug persists.

Previous Concerns Status

⚠️ Inline Python for cap logic → ✅ Moot — the cap is removed entirely. The remaining inline Python is minimal and the count uses native bash arithmetic.
⚠️ GPU Utilization → ✅ Fixed — all non-default GPUs are now utilized.
📋 2-shard cap → ✅ Reverted (experiment concluded).
📋 Commented MULTI_GPU_SKIP_REASON markers → ✅ Markers restored and properly documented.

Assessment

Clean resolution of the diagnostic cycle. The approach is sound: use all GPUs but exclude known-bad test files via a well-documented constant. The improved comments on MULTI_GPU_SKIP_REASON clearly explain the mechanism and removal criteria. No new issues introduced.

Update (commit 3d3f136): New commit adds a fix for build_simulation_context silently ignoring the device kwarg when sim_cfg is also provided.

Changes

simulation_context.py — device parameter default changed from "cuda:0" to None. When device is explicitly passed alongside sim_cfg, it now overrides sim_cfg.device. This fixes warp kernel device-mismatch errors on non-default GPUs.
Changelog entry added documenting the fix.
Test updates — both headless and non-headless test files split the test case to separately verify: (a) sim_cfg values are preserved when no device kwarg is given, and (b) explicit device kwarg wins when both are passed.

Assessment

✅ Correct fix — The None default + explicit override pattern is clean and backward-compatible. Callers not passing device get SimulationCfg's default; callers passing it explicitly get what they asked for.

✅ Well-tested — Both branches (sim_cfg-only and device-override) are covered.

✅ Good documentation — The inline comment explains the rationale and the failure mode it prevents.

⚠️ Scope note — This commit fixes a real bug but is unrelated to the CI multi-GPU sharding work in the PR title. Consider whether this should be a separate PR for cleaner git history, though it is understandable if the bug was discovered during multi-GPU testing.

No blocking issues. LGTM.

Update (commit 3bd01c4): Removes MULTI_GPU_SKIP_REASON from both test_articulation.py files (isaaclab_newton and isaaclab_physx), fully re-enabling them in concurrent multi-GPU CI.

This is the logical conclusion of the diagnostic cycle: the build_simulation_context device-override fix from the previous commit likely resolved the warp kernel failures that were triggering Kit lifecycle issues under concurrent execution. With the root cause addressed, the opt-out markers are no longer needed.

Changes

Both files simply remove the 6-line MULTI_GPU_SKIP_REASON constant block:

# File-level opt-out from concurrent multi-GPU CI. The multi-GPU workflow's
# discover step skips any test file declaring this module-level constant.
# ...
MULTI_GPU_SKIP_REASON = "Kit lifecycle bug: ..."

Assessment

✅ Clean removal — No code changes beyond deleting the opt-out marker.

✅ Logical progression — The diagnostic cycle is complete: 2-shard cap (experiment) → device fix (root cause) → re-enable tests (validation).

⚠️ Risk note — If the Kit lifecycle issues resurface in CI, these markers can be quickly re-added. The workflow's MULTI_GPU_SKIP_REASON discovery mechanism remains in place.

Outstanding Pre-merge Items

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated
Cherry-picked #5881	ℹ️ Noted in PR description	Will drop on rebase after #5881 merges

No new issues introduced. LGTM. ✅

Update (commit 7d04e41): Adds SIGHUP signal handling and fixes _abort_signal_handle_callback to properly exit after cleanup.

Changes

app_launcher.py — Registers SIGHUP handler so child processes shut down cleanly when the parent shell (supervising sibling shards) exits. Rewrites the callback to call sys.exit(128 + signum) after app.close(), preventing Python from resuming with half-torn-down Kit state. Uses contextlib.suppress(Exception) for robustness in signal context.
Changelog entry — Documents both fixes clearly.

Assessment

✅ Correct fix — SIGHUP cascading to child shards was the missing piece causing "Stage X already attached" failures. The explicit sys.exit() after cleanup is the right pattern for replaced signal dispositions.

✅ Clean implementation — Both contextlib and sys were already imported. Parameter rename from signal to signum avoids shadowing the module. Comments are thorough.

✅ No new issues introduced.

Outstanding Pre-merge Items

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated

Update (commit 2b4530d): Two new diagnostic commits (7251b74 + 2b4530d). The latest restores 3 shards and replaces --gpus all with --gpus "device=$cuda" to give each container exclusive access to a single physical GPU. This eliminates the cross-GPU visibility that was triggering the SIGHUP cascade / "Stage already attached" pattern.

Assessment: Clean diagnostic change. The per-container GPU isolation via --gpus device=N is the correct approach — it mirrors MIG-level hardware isolation and prevents Kit processes from interfering with each other's GPU contexts. The explanatory comment is clear and well-reasoned.

No new issues introduced. Previous outstanding items unchanged:

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated

Update (commit 6137b1b): Significant pivot — instead of per-container GPU isolation (--gpus device=N), the new approach eliminates Kit entirely from newton tests (kitless mode) and re-adds MULTI_GPU_SKIP_REASON for the physx test that still requires Kit.

Changes

Workflow — Reverts --gpus "device=$cuda" → --gpus all. The per-container isolation is no longer needed since newton tests no longer boot Kit.
schemas.py — New _create_fixed_joint_to_world() helper replaces the from omni.physx.scripts import utils import with pure pxr.UsdPhysics calls. Well-implemented: handles instance proxy / prototype prim climbing, unique joint naming, and correctly sets body0 (world) / body1 (articulation) with proper transforms via UsdGeom.XformCache.
Newton test files — test_articulation.py and test_rigid_object_collection.py remove AppLauncher boot and test_devices() import, replacing with hardcoded ["cuda:0", "cpu"] or ["cuda:0"]. This makes them fully kitless, avoiding the Kit lifecycle bug entirely.
PhysX test_articulation.py — Re-adds MULTI_GPU_SKIP_REASON to opt out of concurrent multi-GPU CI (this file still requires Kit).
Changelog entries — Properly added for the kitless fix.

Assessment

✅ Better root-cause solution — Rather than working around Kit concurrency bugs with container isolation, this removes the Kit dependency where it isn't needed. Kitless newton tests run faster (~30s saved per file) and are immune to Kit lifecycle issues.

✅ _create_fixed_joint_to_world() is well-written — Correctly mirrors the omni.physx.scripts.utils.createJoint single-selection Fixed branch. The prototype/instance climb, unique naming, and transform extraction are all correct.

✅ Clean test migration — Removing AppLauncher is safe for newton-only tests since SimulationContext gates Kit-specific paths on has_kit().

Outstanding Pre-merge Items

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated
`MULTI_GPU_SKIP_REASON` in physx test	ℹ️ Intentional	Stays until upstream Kit fix (IsaacLab #3475)

No new issues introduced. LGTM. ✅

Update (commit 6744a2d): Major direction change — reverts the kitless newton approach from the previous commit and instead adds robust hang diagnostics (py-spy + gdb stack capture) to the CI workflow.

Changes (vs `6137b1b`)

1. Workflow (`.github/workflows/test-multi-gpu-pytest.yaml`)

Adds --cap-add=SYS_PTRACE to the Docker run command, enabling py-spy/gdb to attach to hung Kit processes inside the container.
Adds py-spy to the in-container pip install list.

2. `tools/conftest.py` — New `_capture_hang_stacks()` Function

Captures Python (py-spy) and C++ (gdb) stack traces from all processes in the test's process group (capped at 8 pids) before SIGKILL erases them. Called on shutdown_hang, startup_hang, and timeout detection. Output is appended to the JUnit diagnostic report. Gracefully degrades when py-spy/gdb are unavailable.

Implementation is solid:

Enumerates process group via ps -o pid= -g <pgid>
Per-pid captures with sensible timeouts (10s py-spy, 20s gdb)
Truncates gdb output at 8KB to avoid flooding CI logs
Safe no-op when tools are missing

3. Reverts kitless newton changes

app_launcher.py — Reverts to simple self._app.close() callback (removes SIGHUP handler, removes contextlib.suppress + sys.exit(128+signum) pattern)
schemas.py — Removes the _create_fixed_joint_to_world() helper, reverts to from omni.physx.scripts import utils import
simulation_context.py — device param reverts to "cuda:0" default, removes the explicit-device-override logic
Newton test files — Restore AppLauncher boot, test_devices() usage, and MULTI_GPU_SKIP_REASON marker
Changelog entries for the reverted features are deleted

4. `MULTI_GPU_SKIP_REASON` in newton test_articulation.py

Re-added with documentation that it should be removed once PR #5883 (kitless newton conversion) lands separately.

Assessment

✅ Good separation of concerns — The kitless newton approach (which is a larger refactor) is being split into its own PR (#5883), while this PR focuses on what it says in the title: multi-GPU CI infrastructure (dynamic work-stealing + hang diagnostics).

✅ Hang diagnostics are well-implemented — _capture_hang_stacks() is robust: handles missing tools, process enumeration failures, and output size limits gracefully. The --cap-add=SYS_PTRACE addition is necessary and minimal.

✅ Clean revert — The AppLauncher signal handling reverts to the simpler pre-existing pattern. The build_simulation_context device override is removed (presumably moving to #5883 as well).

⚠️ Concern: AppLauncher signal handling regression — The previous commit introduced valuable hardening (SIGHUP handler, sys.exit(128+signum) to prevent post-handler resume, contextlib.suppress for robustness in signal context). Reverting this means the Kit half-torn-down issue persists. Presumably this will be addressed in #5883 or a dedicated fix, but worth tracking.

⚠️ build_simulation_context device-override revert — The device kwarg no longer overrides sim_cfg.device when both are passed. The original bug (warp kernel device-mismatch on non-default GPUs) may resurface in multi-GPU test scenarios. The test changes confirm this is intentional ("values don't get overridden"), but callers passing both kwargs will silently get sim_cfg's device.

Outstanding Pre-merge Items

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated
`MULTI_GPU_SKIP_REASON` in newton test	ℹ️ Temporary	Drops when #5883 lands
AppLauncher SIGHUP / exit hardening	⚠️ Reverted	Track for #5883 or separate PR
`build_simulation_context` device override	⚠️ Reverted	Track for #5883 or separate PR

Verdict

LGTM for the focused scope ✅ — The PR now cleanly addresses its stated goal: multi-GPU CI improvements with hang diagnostics. The kitless newton refactor is correctly being separated into #5883. The diagnostic stack capture is a strong addition that will make Kit shutdown hangs observable. The reverts are intentional scope reduction, not regressions (assuming the features land via #5883).

Update (commit 6f09e08): Adds ISAACLAB_PIN_KIT_GPU environment variable support — a clean, targeted fix for the Kit multi-GPU renderer interference that was causing the shutdown hangs.

Changes

1. `app_launcher.py` — New `ISAACLAB_PIN_KIT_GPU` env var handling

When ISAACLAB_PIN_KIT_GPU is set to a truthy value (anything not in {"", "0", "false", "no", "off"}), appends three Kit command-line flags:

--/renderer/multiGpu/enabled=False
--/renderer/multiGpu/autoEnable=False
--/renderer/multiGpu/maxGpuCount=1

This pins each Kit process to its assigned GPU only, preventing the shared cubric / PhysX-fabric GPU-interop context that was causing [Error] [omni.physx.plugin] Stage X already attached and SimulationApp.close hangs.

Implementation is clean: Placed logically after the _resolve_device_settings method's existing GPU assignment. The truthy-value check handles common falsy strings. Logger info message aids debugging.

2. Workflow (`.github/workflows/test-multi-gpu-pytest.yaml`)

Adds -e ISAACLAB_PIN_KIT_GPU=1 to the Docker run command. Simple, non-invasive.

3. `test_articulation.py` (isaaclab_physx)

Removes MULTI_GPU_SKIP_REASON — re-enabling the test in concurrent multi-GPU CI. With Kit pinned to a single GPU per process, the concurrency bug should no longer trigger.

4. Changelog entry (`jichuanh-mgpu-pin-kit-resources.rst`)

Well-written: documents the env var, explains the Kit renderer behavior it overrides, references the issue (#3475) and NVBug, and clarifies it's off by default.

Assessment

✅ Excellent root-cause fix — Instead of working around the problem (shard caps, container isolation, removing Kit entirely), this directly addresses the Kit renderer's multi-GPU enumeration that was causing cross-process interference. Kelly Guo's documented WAR is properly attributed.

✅ Opt-in design — Off by default, only activated in CI via env var. Single-GPU and user-facing rendering paths are completely unchanged. No risk to existing workflows.

✅ Minimal footprint — 4 lines of functional code in app_launcher.py + 1 env var in the workflow. Clean and auditable.

✅ Enables test re-enablement — The physx test_articulation.py can now run in concurrent multi-GPU CI since the underlying cause is addressed.

Minor Observations

Comment references internal person — The inline comment in app_launcher.py mentions "Kelly Guo's documented WAR (#omni-kit thread, 2024-2025)". Consider whether this internal reference is appropriate for an open-source codebase, or if just the NVBug reference suffices.
Newton test_articulation.py unchanged — The newton equivalent still has MULTI_GPU_SKIP_REASON (presumably waiting for #5883 kitless conversion). This is fine and consistent with the PR's scope.

Outstanding Pre-merge Items (Updated)

Item	Status	Notes
`run_docker_tests: 'false'`	⚠️ Still hardcoded	build.yaml — revert before final merge
45-minute timeout	ℹ️ Diagnostic	Can tighten to 30 once validated
Newton `MULTI_GPU_SKIP_REASON`	ℹ️ Expected	Drops when #5883 lands
AppLauncher SIGHUP / exit hardening	ℹ️ Tracked	For #5883 or separate PR
`build_simulation_context` device override	ℹ️ Tracked	For #5883 or separate PR

Verdict

LGTM ✅ — This is the cleanest solution in the PR's evolution. Rather than elaborate workarounds, it applies the known Kit WAR (disable multi-GPU renderer enumeration) via an opt-in env var. The implementation is minimal, well-documented, and correctly scoped. Ready for merge pending CI validation of the re-enabled physx test.

This reverts commit e7d1af7.

Most test callers pass both ``sim_cfg=`` and ``device=`` to :func:`isaaclab.sim.build_simulation_context`, implicitly expecting the ``device`` kwarg to win. The helper previously dropped the kwarg silently when ``sim_cfg`` was provided, causing warp kernel-launch device mismatches on non-default GPUs: the test fixture allocated ``env_ids`` on the requested device while the articulation's ``self.device`` resolved from the untouched ``sim_cfg`` default (``cuda:0``), and ``wp.launch(..., device=self.device)`` failed with:: RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index', trying to launch on device='cuda:0', but input array for argument 'env_ids' is on device=cuda:2. Change ``device``'s default to ``None`` (sentinel) and apply it as an override after sim_cfg construction in both branches. The one test that asserted the old "sim_cfg overrides everything" contract is updated to cover the new override semantics.

Drop the MULTI_GPU_SKIP_REASON marker from both the newton and physx test_articulation variants so they participate in dynamic 3-shard multi-GPU pytest again. Pairs with the cherry-picked device-kwarg fix to validate whether the Kit lifecycle hang is exacerbated by the device-drift bug. If the multi-GPU pytest workflow now holds up consistently across re-runs, the upstream Kit issue may not require the file-level skip.

Two coupled bugs in :class:`isaaclab.app.AppLauncher`: 1. SIGHUP was unhandled. Kit launches with ``--/app/installSignalHandlers=0``, so when a controlling session leader exits (e.g. the parent shell that supervises sibling shards in multi-GPU CI), child Kit processes receive SIGHUP with default disposition: terminate. ``_atexit_close`` does not run, so ``SimulationApp.close`` is skipped and USD/PhysX state is left attached. The next sibling shard then trips ``[Error] [omni.physx.plugin] Stage X already attached`` and Kit shutdown subsequently hangs on the orphan's state. Register the same handler used for SIGTERM/SIGABRT/SIGSEGV. 2. ``_abort_signal_handle_callback`` swallowed the signal's terminate semantics. After calling ``self._app.close()`` it returned, so Python resumed execution past the signal as if nothing happened. The replaced OS-default disposition would have killed the process; the Python handler did not. Wrap ``_app.close()`` in ``contextlib.suppress(Exception)`` and call ``sys.exit(128 + signum)`` to preserve the conventional signal-exit encoding and actually terminate.

Pin shard_count to min(available, 2) to test whether the Kit lifecycle hang (SIGHUP cascade + "Stage already attached" + 52s shutdown hang on test_articulation) only manifests at 3+ concurrent Kit processes. Local 3-MIG repro on Horde passes cleanly (hardware-isolated MIG slices); CI 3-shard on shared-GPU runners fails consistently. This commit narrows the failure window so the data tells us: * 2-shard CI green and consistent -> 3+ is the concurrency threshold; isolation layer or CUDA_VISIBLE_DEVICES per-shard is the fix. * 2-shard CI still flaky -> something other than process count is the trigger; deeper investigation needed. Revert after the data is collected.

Two changes in one commit (paired diagnostic): 1. Restore the dynamic shard_count = N-1 computation; the 2-shard cap diagnostic is being superseded by this run. 2. Replace ``--gpus all`` with ``--gpus device=$cuda`` so each shard container sees only one physical GPU. Mirrors the hardware-level isolation that MIG provides on the Horde 3-shard local repro (which passes cleanly), and removes the cross-process GPU visibility that the multi-GPU CI runner currently allows. Hypothesis: the SIGHUP cascade + "Stage already attached" pattern only fires when sibling Kit processes can see each other's GPUs and share host driver state. If this commit's CI is green, isolation is the fix and we make this permanent. Revert after the data is collected.

Paired with the previous ``--gpus device=$cuda`` isolation diagnostic. With per-shard GPU isolation, each container sees exactly one physical GPU and it appears as ``cuda:0`` inside the container. The previous ``ISAACLAB_TEST_DEVICES=$runtime_devices`` (e.g. ``"0001"`` for cuda:2) and ``ISAACLAB_SIM_DEVICE=cuda:$cuda`` (e.g. ``cuda:2``) tried to use indices the container can no longer see, so collection failed: ValueError: ISAACLAB_TEST_DEVICES='0001' names no device available on this host (available: ['cpu', 'cuda:0']) Set both to ``cuda:0``/``01`` unconditionally. The work queue still distributes files across the 3 shards so each physical GPU exercises a different slice.

…ntainer" This reverts commit 5662e00.

This reverts commit 2b4530d.

This reverts commit 7251b74.

Restore the MULTI_GPU_SKIP_REASON marker on the physx variant only. Newton test_articulation drops AppLauncher entirely via PR isaac-sim#5883, so it runs cleanly under concurrent multi-GPU. The physx variant must still boot Kit for omni.physics; under 3-shard concurrent CI runners (shared GPU visibility) Kit's shutdown hangs >52s, causing SIGHUP cascade across sibling shards and "Stage already attached" errors. Cross-linked upstream at IsaacLab isaac-sim#3475 / OMPE-43816 (deferred past Isaac Sim 5.0 per the engineering thread).

Bundles the kitless conversion of newton test_articulation + test_rigid_object_collection into the dynamic-sharding branch so the multi-GPU CI workflow actually exercises a non-Kit-booted newton test_articulation alongside the physx skip. Will rebase away when isaac-sim#5883 lands. Includes the universal schemas.py fix (``_create_fixed_joint_to_world`` replaces unguarded ``omni.physx.scripts.utils.createJoint``) and the .skip changelog fragments for the test-only packages.

… together" This reverts commit 6137b1b.

This reverts commit 7d04e41.

…ntext" This reverts commit 3d3f136.

Adds ``_capture_hang_stacks(pid, pgid, kill_reason)`` and calls it from the hang-detection path (startup_hang / shutdown_hang / timeout) before SIGKILL erases the evidence. Captures: * ``py-spy dump --pid`` -> Python frames showing where Python code is parked inside ``app.close()`` or pytest teardown. * ``gdb -batch -ex "thread apply all bt" -p`` -> C++ frames inside ``omniverse_kit`` / ``omni.physx.plugin`` / CUDA driver binaries. Critical because Kit core is closed source — without this we have no way to localize the hang in IsaacLab isaac-sim#3475 / OMPE-43816. Walks the entire process group (capped at 8 pids) so any Kit extension helper child that's the actual culprit is also dumped. Each tool is optional: missing py-spy or gdb is reported inline rather than failing the diagnostic capture. No behavioral change to passing runs. Output lands in the same ``pre_kill_diag`` block that already gets attached to the JUnit error report when a kill fires.

Two small additions to make ``tools/conftest.py``'s hang capture actually work in CI: * ``--cap-add=SYS_PTRACE`` on the per-shard ``docker run``: required for ``py-spy dump`` and ``gdb -p`` to attach to the hung Kit process. Without it both tools come back as "Permission Denied" (verified locally on a synthetic hung subprocess). * ``py-spy`` added to the in-container ``pip install`` list so the capture function can find it on PATH. ``gdb`` is already present in the ECR image. The capture is gated by the existing ``shutdown_hang`` / ``startup_hang`` / ``timeout`` detection in conftest, so on green runs neither tool is invoked.

After dropping the cherry-pick of the kitless newton conversion to keep this PR scoped to CI infra, the newton variant of test_articulation once again boots Kit at module level and is subject to the same concurrent-Kit shutdown hang / SIGHUP cascade as the physx variant. Restore the ``MULTI_GPU_SKIP_REASON`` marker on the newton variant so the multi-GPU discover-step filter excludes it. The marker comment points at isaac-sim#5883, which removes the AppLauncher boot from this file and lets the kitless SimulationContext path carry the test. After isaac-sim#5883 lands and this PR rebases on develop, the marker can be dropped in the same commit that re-enables it. Both newton and physx test_articulation are now consistently skipped from multi-GPU; both still run in single-GPU CI.

The default ``apps/isaaclab.python.headless.kit`` sets ``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable = true``, so each Kit process enumerates every visible GPU at startup. Under concurrent multi-GPU CI shards (``--gpus all`` per container, one Kit per non-default cuda device), that produces a shared cubric / PhysX-fabric GPU-interop context across sibling processes -- surfacing as ``[Error] [omni.physx.plugin] Stage X already attached`` mid-test and ``SimulationApp.close`` hanging >52s in teardown. Tracked upstream at IsaacLab isaac-sim#3475 / NVBug 5687364. Kelly Guo's documented WAR (#omni-kit thread, 2024-2025): set ``renderer.multiGpu.enabled = false`` + ``maxGpuCount = 1`` so each Kit only touches its assigned GPU. Adds opt-in ``ISAACLAB_PIN_KIT_GPU`` env var. When truthy, AppLauncher appends three flags to the Kit command line: - ``--/renderer/multiGpu/enabled=False`` - ``--/renderer/multiGpu/autoEnable=False`` - ``--/renderer/multiGpu/maxGpuCount=1`` Off by default; single-GPU and user-facing rendering paths are unchanged. CI workflows that need bounded resource visibility set ``ISAACLAB_PIN_KIT_GPU=1`` on the runner. Local validation: Blackwell hardware (current Horde) does not reproduce the upstream hang due to MIG topology limitations (only 3 torch-visible cuda devices), so the change is shipped as a CI A/B hypothesis test rather than a verified fix. The implementation is small, opt-in, and reversible.

…p marker Wires the new ``ISAACLAB_PIN_KIT_GPU`` env var (from the cherry-picked mgpu-pin-kit-resources commit) into the per-shard ``docker run`` and re-enables physx test_articulation in the multi-GPU lane by dropping its MULTI_GPU_SKIP_REASON marker. Direct CI A/B for Kelly Guo's documented WAR: if the upstream cubric / PhysX-fabric GPU-interop race on shared CUDA contexts is the trigger for the 52s shutdown_hang + SIGHUP cascade observed in run 26698100037, pinning each Kit to a single GPU should clear it. Three consecutive green runs on the same SHA would confirm.

Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).

hujc7 added 30 commits May 26, 2026 23:10

Install pytest in multi-GPU pytest workflow

d6b2934

pytest is not pulled in by --install none or by isaacsim[all,extscache]. Runner state was masking this; pin it explicitly.

Add changelog fragment for multi-GPU CI helpers

6eeda64

Run docker container as host uid:gid

c45604c

Image's default USER is isaaclab (uid 1000), which doesn't own the volume-mounted host workspace, so it can't ln -s _isaac_sim (perm denied) — falling back to PATH python3 which doesn't exist in the image, hence pytest exit 127.

Point HOME / XDG_CACHE_HOME at a writable tmp dir

6706690

Running container as host uid:gid means the image's default /root home is not writable, so Warp/numpy/pip cache writes hit PermissionError [Errno 13] '/root/.cache'. Mount a fresh tmp dir and point HOME + XDG_CACHE_HOME at it.

Revert "TEMP: force run_docker_tests=false while iterating multi-GPU CI"

5d29bb0

This reverts commit 665f0c3.

TEMP: re-apply run_docker_tests=false until PR 5823 lands

d6d69c4

Per ~/.claude/skills/pr/ci-iteration-shortcut.md. Keep the single-GPU Docker + Tests matrix disabled until iteration is over and the PR is ready to land. Revert as the last commit before merge.

Wrap long line in PhysX no-friction skip reason

c37b0f9

isaaclab-review-bot Bot reviewed May 30, 2026

View reviewed changes

Revert "TEMP: 2-shard cap workaround for Kit lifecycle bug"

4bc8b33

This reverts commit e7d1af7.

hujc7 mentioned this pull request May 30, 2026

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context #5881

Draft

hujc7 mentioned this pull request May 30, 2026

[MGPU] Tests: make newton-only tests kitless (drop AppLauncher boot) #5883

Draft

hujc7 mentioned this pull request May 31, 2026

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang #5886

Closed

hujc7 added 15 commits May 31, 2026 00:10

Revert "DIAGNOSTIC: address each shard's lone GPU as cuda:0 inside co…

02b6e2e

…ntainer" This reverts commit 5662e00.

Revert "DIAGNOSTIC: restore 3 shards + isolate GPUs via --gpus device=N"

098811d

This reverts commit 2b4530d.

Revert "DIAGNOSTIC: cap multi-GPU pytest to 2 shards"

7c68639

This reverts commit 7251b74.

Revert "Cherry-pick kitless newton tests (isaac-sim#5883) to validate…

5ef1e4e

… together" This reverts commit 6137b1b.

Revert "Handle SIGHUP and force exit in AppLauncher abort handler"

6fbe988

This reverts commit 7d04e41.

Revert "Honor device kwarg over sim_cfg.device in build_simulation_co…

5fc5289

…ntext" This reverts commit 3d3f136.

hujc7 changed the title ~~[CI] Cross-platform — Part 5: Dynamic work-stealing across multi-GPU shards~~ [CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture Jun 3, 2026

Add changelog entry for conftest stack capture

6744a2d

hujc7 mentioned this pull request Jun 3, 2026

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU #5933

Draft

hujc7 added 2 commits June 3, 2026 02:52

hujc7 mentioned this pull request Jun 3, 2026

[DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin #5934

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture#5875

[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture#5875
hujc7 wants to merge 77 commits into
isaac-sim:developfrom
hujc7:jichuanh/multi-gpu-dynamic-sharding

hujc7 commented May 30, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture

Summary

1. Scope

2. CI mechanics

2.1 Dynamic work-stealing

2.2 Discover step + opt-in / opt-out

2.3 Hang stack capture

2.4 AppLauncher honors ISAACLAB_SIM_DEVICE

3. Files currently skipped from multi-GPU lane

4. Validation

5. Non-scope

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Changes in This Commit

1. .github/workflows/test-multi-gpu-pytest.yaml

2. Test Files Re-enabled

Assessment

Comments on Implementation

Outstanding Items (Cumulative)

Verdict

Previous Concerns Status

Assessment

Changes

Assessment

Changes

Assessment

Outstanding Pre-merge Items

No new issues introduced. LGTM. ✅

Changes

Assessment

Outstanding Pre-merge Items

Changes

Assessment

Outstanding Pre-merge Items

Changes (vs 6137b1b)

1. Workflow (.github/workflows/test-multi-gpu-pytest.yaml)

2. tools/conftest.py — New _capture_hang_stacks() Function

3. Reverts kitless newton changes

4. MULTI_GPU_SKIP_REASON in newton test_articulation.py

Assessment

Outstanding Pre-merge Items

Verdict

Changes

1. app_launcher.py — New ISAACLAB_PIN_KIT_GPU env var handling

2. Workflow (.github/workflows/test-multi-gpu-pytest.yaml)

3. test_articulation.py (isaaclab_physx)

4. Changelog entry (jichuanh-mgpu-pin-kit-resources.rst)

Assessment

Minor Observations

Outstanding Pre-merge Items (Updated)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hujc7 commented May 30, 2026 •

edited

Loading

2.4 AppLauncher honors `ISAACLAB_SIM_DEVICE`

isaaclab-review-bot Bot left a comment •

edited

Loading

1. `.github/workflows/test-multi-gpu-pytest.yaml`

Changes (vs `6137b1b`)

1. Workflow (`.github/workflows/test-multi-gpu-pytest.yaml`)

2. `tools/conftest.py` — New `_capture_hang_stacks()` Function

4. `MULTI_GPU_SKIP_REASON` in newton test_articulation.py

1. `app_launcher.py` — New `ISAACLAB_PIN_KIT_GPU` env var handling

2. Workflow (`.github/workflows/test-multi-gpu-pytest.yaml`)

3. `test_articulation.py` (isaaclab_physx)

4. Changelog entry (`jichuanh-mgpu-pin-kit-resources.rst`)