[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933
[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933hujc7 wants to merge 1 commit into
Conversation
The default ``apps/isaaclab.python.headless.kit`` sets ``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable = true``, so each Kit process enumerates every visible GPU at startup. Under concurrent multi-GPU CI shards (``--gpus all`` per container, one Kit per non-default cuda device), that produces a shared cubric / PhysX-fabric GPU-interop context across sibling processes -- surfacing as ``[Error] [omni.physx.plugin] Stage X already attached`` mid-test and ``SimulationApp.close`` hanging >52s in teardown. Tracked upstream at IsaacLab isaac-sim#3475 / NVBug 5687364. Kelly Guo's documented WAR (#omni-kit thread, 2024-2025): set ``renderer.multiGpu.enabled = false`` + ``maxGpuCount = 1`` so each Kit only touches its assigned GPU. Adds opt-in ``ISAACLAB_PIN_KIT_GPU`` env var. When truthy, AppLauncher appends three flags to the Kit command line: - ``--/renderer/multiGpu/enabled=False`` - ``--/renderer/multiGpu/autoEnable=False`` - ``--/renderer/multiGpu/maxGpuCount=1`` Off by default; single-GPU and user-facing rendering paths are unchanged. CI workflows that need bounded resource visibility set ``ISAACLAB_PIN_KIT_GPU=1`` on the runner. Local validation: Blackwell hardware (current Horde) does not reproduce the upstream hang due to MIG topology limitations (only 3 torch-visible cuda devices), so the change is shipped as a CI A/B hypothesis test rather than a verified fix. The implementation is small, opt-in, and reversible.
There was a problem hiding this comment.
🤖 IsaacLab Review Bot — PR #5933
[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU
✅ Overall Assessment: LGTM (minor suggestions)
This is a clean, well-scoped, opt-in workaround for a documented upstream Kit bug. The implementation is minimal, reversible, and placed correctly.
📋 Summary of Changes
| File | Change |
|---|---|
source/isaaclab/isaaclab/app/app_launcher.py |
+19 lines: env var check + 3 Kit CLI flags + log message |
source/isaaclab/changelog.d/jichuanh-mgpu-pin-kit-resources.rst |
+14 lines: changelog fragment |
👍 Strengths
- Opt-in by default — Zero risk to existing single-GPU or user-facing paths. The env var must be explicitly set.
- Correct placement — The pinning logic sits in
_resolve_device_settings()right afteractive_gpu/physics_gpuassignment, which is the logical place for renderer GPU configuration. - Robust truthiness parsing — The
not in {"", "0", "false", "no", "off"}pattern after.lower()covers common shell conventions. - Good logging —
logger.info(...)emits a clear signal when the pin is active. - Thorough PR description — Excellent traceability to upstream bugs (NVBug 5687364, #3475) and documented WAR source (Kelly Guo).
- Changelog present — Properly formatted RST fragment.
💡 Minor Suggestions (non-blocking)
-
Idempotency guard: If a user already passes
--/renderer/multiGpu/enabled=Falsemanually via CLI, this would append a duplicate. Consider checkingsys.argvfor existing overrides before appending:if not any("--/renderer/multiGpu/enabled" in arg for arg in sys.argv): sys.argv.append("--/renderer/multiGpu/enabled=False")
(Low priority — Kit likely handles duplicates gracefully by last-wins, but explicit > implicit.)
-
Consider logging the trigger source: The log message could include the env var value for debugging:
logger.info("ISAACLAB_PIN_KIT_GPU=%s: pinning Kit renderer to a single GPU", raw_val)
-
Test coverage: The PR description mentions smoke testing locally, but there's no unit test for the env var parsing logic. A small parametrized test (mocking
os.environand checkingsys.argvafter_resolve_device_settings) would be valuable for regression, especially since this interacts with CI infrastructure.
🔍 Potential Concerns
- CI validation pending — The PR body notes that the actual A/B test against the multi-GPU hang is still pending (cherry-pick onto #5875). This is appropriate for merge given it's opt-in, but the CI experiment should be tracked to confirm efficacy.
CI Status
- ✅ pre-commit, labeler, changelog check, build wheel: pass
- ⏳ Installation tests, Docker builds, docs: pending (expected for fresh push)
Verdict: Ship it. Clean opt-in workaround with proper documentation. The minor suggestions above are non-blocking improvements that could be addressed in a follow-up.
Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).
[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU
Summary
ISAACLAB_PIN_KIT_GPUenv var. When truthy,AppLauncherappends--/renderer/multiGpu/enabled=False,--/renderer/multiGpu/autoEnable=False,--/renderer/multiGpu/maxGpuCount=1to the Kit command line, pinning each Kit process to a single GPU.[Error] [omni.physx.plugin] Stage X already attachedmid-test andSimulationApp.closehanging >52s in teardown.1. Problem
The default
apps/isaaclab.python.headless.kitsets:So every Kit process enumerates every visible GPU at startup. AppLauncher pins
active_gpu+physics_gputo the requested device, butrenderer.multiGpu.enabledstays true.Under multi-GPU CI shards (
--gpus allper container, one Kit per non-default cuda device), three sibling Kit processes each scan all 4 visible GPUs. cubric's GPU-interop state and PhysX-fabric attach paths then race across processes on the same host. Direct evidence from CI run 26698100037:Same shape as NVBug 5687364 ("PhysX Fabric Manager SIGSEGV crash during multi-GPU distributed training on Kit 109", Gerald Gainant: "corrupted cuda context (crashed device) detected by cubric lib but occurring upstream in physics or fabric"). Tracked at #3475.
2. Fix
Kelly Guo's documented workaround from #omni-kit thread (Slack, 2024-2025):
Implementation: AppLauncher checks
ISAACLAB_PIN_KIT_GPU. When truthy (1,true,yes, etc.), three flags are appended tosys.argvbeforeSimulationAppboots:Off by default. Existing single-GPU CI and user code that wants multi-GPU rendering (rare for IsaacLab use cases) are untouched.
3. Validation
3.1 Local
Local Blackwell (RTX PRO 6000) cannot reproduce the upstream hang at the >=3 concurrent Kit threshold due to MIG topology limitations: only 3 torch-visible cuda devices (
cuda:0/cuda:1/cuda:2) are reachable inside one container, not the 4 needed to match CI's cuda:1/2/3 shard layout. Earlier audit's "3-MIG cross-GPU FAILS identically to CI" was on prior L40 hardware.Smoke test confirmed the env var parsing and Kit cmdline injection work correctly:
ISAACLAB_PIN_KIT_GPU={1,true,yes,TRUE}→ pin=True;={"", 0, false, no, off}→ pin=False.3.2 CI A/B (the actual validation)
Pending: cherry-pick this PR onto #5875's branch, add
-e ISAACLAB_PIN_KIT_GPU=1to the per-sharddocker run, temporarily drop oneMULTI_GPU_SKIP_REASONmarker, and observe whether the 52s shutdown hang + SIGHUP cascade clears.4. Non-scope
.kitfile; the change is opt-in only. Users running standalone Kit with multi-GPU rendering are unaffected.[MGPU] App: bounded shutdown): [MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang #5886 bounds how long the hang can stall teardown viaISAACLAB_FORCE_EXIT_TIMEOUT; this PR addresses the precondition that triggers the hang in the first place.