Skip to content

fix(gpu): select single CDI GPU defaults#1675

Open
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar
Open

fix(gpu): select single CDI GPU defaults#1675
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Jun 2, 2026

🏗️ build-from-issue-agent

Summary

Implement driver-owned CDI GPU default selection for Docker and Podman. Bare --gpu requests now resolve to one NVIDIA CDI device from driver inventory, while explicit --gpu-device values pass through unchanged.

Related Issue

Closes #1477

Changes

  • crates/openshell-core/src/gpu.rs: Added normalized CDI GPU inventory, naming-family selection, and a concurrency-safe round-robin cursor.
  • crates/openshell-driver-docker/: Uses Docker DiscoveredDevices for NVIDIA CDI inventory, keeps CDISpecDirs as the CDI support gate, and peeks vs consumes defaults on validation vs create.
  • crates/openshell-driver-podman/: Maps local /dev/nvidiaN nodes to nvidia.com/gpu=N inventory and selects defaults through the same round-robin helper.
  • e2e/rust/tests/gpu_device_selection.rs: Updates default GPU expectations from all-GPU to selected-single-GPU semantics.
  • Docs and driver READMEs: Document Docker and Podman inventory behavior and Podman remote inventory limitations.

Deviations from Plan

None - implemented as planned. Actual GPU e2e execution was environment-blocked locally; the modified GPU e2e target was compiled.

Testing

  • mise x -- cargo test -p openshell-core
  • mise x -- cargo test -p openshell-driver-docker
  • mise x -- cargo test -p openshell-driver-podman
  • mise x -- cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-gpu --test gpu_device_selection --no-run
  • mise run pre-commit
  • GPU e2e lanes not run locally: Docker reports CDI inventory only for docker.com/gpu=webgpu, no nvidia.com/gpu=...; Podman is not installed; /dev/nvidia0 is absent.

Tests added:

  • Unit: Core GPU inventory/round-robin tests; Docker inventory/default selection tests; Podman local inventory/default selection tests.
  • Integration: N/A.
  • E2E: Updated GPU device selection expectations and compiled the GPU e2e target.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

Documentation updated:

  • docs/sandboxes/manage-sandboxes.mdx
  • docs/reference/sandbox-compute-drivers.mdx
  • crates/openshell-driver-docker/README.md
  • crates/openshell-driver-podman/README.md

Closes #1477

Add driver-owned CDI GPU inventory selection for Docker and Podman so bare GPU requests resolve to one default device without allocation tracking.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gpu): select one CDI GPU by default for Docker and Podman

1 participant