Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions source/isaaclab/changelog.d/jichuanh-mgpu-pin-kit-resources.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Added
^^^^^

* Added ``ISAACLAB_PIN_KIT_GPU`` env var for :class:`~isaaclab.app.AppLauncher`.
When set to a truthy value, appends ``--/renderer/multiGpu/enabled=False``,
``--/renderer/multiGpu/autoEnable=False`` and ``--/renderer/multiGpu/maxGpuCount=1``
to the Kit command line so each Kit process touches only its assigned
GPU (rather than enumerating every visible GPU at startup). Used by the
multi-GPU CI workflow to prevent the shared cubric / PhysX-fabric
GPU-interop context across sibling shards that surfaces as
``[Error] [omni.physx.plugin] Stage X already attached`` and
``SimulationApp.close`` hangs (see https://github.com/isaac-sim/IsaacLab/issues/3475
and NVBug 5687364). Off by default; single-GPU and user-facing rendering
paths are unchanged.
19 changes: 19 additions & 0 deletions source/isaaclab/isaaclab/app/app_launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -1060,6 +1060,25 @@ def _resolve_device_settings(self, launcher_args: dict):
launcher_args["physics_gpu"] = self.device_id
launcher_args["active_gpu"] = self.device_id

# Pin Kit's renderer to a single GPU when ``ISAACLAB_PIN_KIT_GPU`` is
# truthy. The default ``apps/isaaclab.python.headless.kit`` sets
# ``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable
# = true``, so each Kit process enumerates every visible GPU at
# startup. Under concurrent multi-GPU CI shards (``--gpus all`` per
# container, one Kit per non-default cuda device), that produces a
# shared cubric / PhysX-fabric GPU-interop context across sibling
# processes -- surfacing as ``[Error] [omni.physx.plugin] Stage X
# already attached`` mid-test and ``SimulationApp.close`` hanging
# >52s in teardown (see https://github.com/isaac-sim/IsaacLab/issues/3475
# and NVBug 5687364). Kelly Guo's documented WAR (#omni-kit thread,
# 2024-2025): set ``renderer.multiGpu.enabled = false`` + ``maxGpuCount
# = 1`` so each Kit only touches its assigned GPU.
if os.environ.get("ISAACLAB_PIN_KIT_GPU", "0").lower() not in {"", "0", "false", "no", "off"}:
sys.argv.append("--/renderer/multiGpu/enabled=False")
sys.argv.append("--/renderer/multiGpu/autoEnable=False")
sys.argv.append("--/renderer/multiGpu/maxGpuCount=1")
logger.info("ISAACLAB_PIN_KIT_GPU enabled: pinning Kit renderer to a single GPU")

# Defer importing torch until after SimulationApp starts. Importing
# torch can import NumPy/OpenBLAS, whose at-fork handlers can crash
# Kit's platform-info fork during startup.
Expand Down
Loading