Skip to content

fix(gpu): select one CDI GPU by default for Docker and Podman #1477

@elezar

Description

@elezar

Description

Update Docker and Podman GPU sandbox defaults so --gpu prefers one CDI GPU device instead of defaulting to nvidia.com/gpu=all.

This is part of the GPU roadmap in #1444. --gpu means the active driver's default GPU behavior, and for GPU-enabled drivers that default should inject or allocate one suitable GPU when the runtime supports individual device selection.

Context

Parent roadmap: #1444

Current local-container behavior maps a GPU request with no explicit gpu_device to nvidia.com/gpu=all through the shared CDI helper. That makes Docker and Podman inconsistent with Kubernetes and VM behavior, where a default GPU request maps to one GPU.

Docker has priority for implementation because OpenShell's Docker GPU path and CDI discovery are more mature today. Podman should be handled in the same task, but may require additional runtime support or an out-of-band CDI device discovery path. Upstream Podman behavior such as podman-container-tools/podman#28712 may be relevant.

Proposed Scope

  • Define local-container default GPU selection semantics for Docker and Podman.
  • Change Docker default --gpu behavior to prefer one CDI GPU device instead of nvidia.com/gpu=all.
  • Change Podman default --gpu behavior to prefer one CDI GPU device instead of nvidia.com/gpu=all.
  • Prefer runtime-reported CDI inventory when available.
  • Preserve explicit --gpu-device behavior as a driver-native advanced option.
  • Do not add multi-GPU count support in this task.
  • Do not require OpenShell-managed GPU assignment/exclusivity tracking in this task.

Target Behavior

Default GPU selection should use this order:

  1. If the runtime reports individual CDI GPU devices, select one individual device.
  2. If reliable CDI inventory is unavailable but individual device IDs are expected to work, fall back to nvidia.com/gpu=0.
  3. If the runtime/platform only reports or supports nvidia.com/gpu=all, such as some WSL2-based setups, use nvidia.com/gpu=all as a compatibility fallback.

Additional behavior:

  • openshell sandbox create --gpu ... on Docker injects one CDI GPU device when individual device selection is available.
  • openshell sandbox create --gpu ... on Podman injects one CDI GPU device when individual device selection is available.
  • openshell sandbox create --gpu --gpu-device nvidia.com/gpu=0 ... continues to pass the explicit CDI device ID through.
  • The fallback to nvidia.com/gpu=all should be intentional and documented, not the default for platforms with individual device selection.
  • Non-zero gpu_count remains unsupported unless a driver explicitly implements count-based allocation.

Out of Scope

This task fixes default GPU device selection cardinality. It does not require OpenShell to track active GPU assignments or prevent two OpenShell sandboxes from selecting the same default GPU.

If multiple sandboxes are created concurrently, selecting the same default fallback device is acceptable until a separate allocation/exclusivity task is implemented.

Open Questions

  • Where should CDI inventory discovery live: shared OpenShell core helper, driver-specific code, or both?
  • What should Podman use as the authoritative CDI device inventory source before runtime-level enumeration is reliable?
  • Should assignment/exclusivity tracking be added later at the driver level or as part of a broader resource allocation model?

Definition of Done

  • Docker default --gpu prefers one individual CDI GPU device when available.
  • Podman default --gpu prefers one individual CDI GPU device when available.
  • If reliable CDI inventory is unavailable and individual IDs are expected to work, default selection falls back to nvidia.com/gpu=0.
  • If individual selection is unavailable, nvidia.com/gpu=all remains available as a documented compatibility fallback.
  • Explicit --gpu-device pass-through behavior is preserved for Docker and Podman.
  • Tests cover individual-device default selection, fallback selection, and explicit device pass-through.
  • Docs describe Docker/Podman default GPU behavior, compatibility fallback behavior, and --gpu-device as an advanced driver-native option.

Metadata

Metadata

Assignees

Labels

state:agent-readyApproved for agent implementationstate:pr-openedPR has been opened for this issue

Type

No type
No fields configured for issues without a type.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions