-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Problem
When using envbox with GPU passthrough in Kubernetes (runtimeClassName: nvidia + nvidia.com/gpu resource limits), setting CODER_ADD_GPU=true and CODER_USR_LIB_DIR=/var/coder/usr/lib correctly passes through /dev/nvidia* device nodes to the inner container, but the automatic library detection via usrLibGPUs() does not mount the required NVIDIA libraries into the inner container.
As a result, nvidia-smi inside the inner container fails with:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
The outer container's nvidia-smi works fine.
Workaround
Manually specifying the library mounts via CODER_MOUNTS resolves the issue:
- name: CODER_MOUNTS
value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"With both CODER_ADD_GPU=true (for device passthrough) and CODER_MOUNTS (for libraries), GPU passthrough works end-to-end without needing to manually recreate the inner container.
Environment
- envbox version: 0.6.5
- Kubernetes:
runtimeClassName: nvidiawithnvidia.com/gpu: "1"resource limits - GPU: Tesla T4
- Host library path:
/usr/lib64mounted into the outer container at/var/coder/usr/lib - Inner image tested:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Pod Spec (relevant sections)
spec:
runtimeClassName: nvidia
containers:
- image: ghcr.io/coder/envbox:0.6.5
env:
- name: CODER_INNER_IMAGE
value: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
- name: CODER_INNER_USERNAME
value: root
- name: CODER_ADD_GPU
value: "true"
- name: CODER_USR_LIB_DIR
value: /var/coder/usr/lib
- name: CODER_MOUNTS
value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
privileged: true
volumeMounts:
- mountPath: /var/coder/usr/lib
name: usr-lib
volumes:
- hostPath:
path: /usr/lib64
type: Directory
name: usr-libExpected Behavior
When CODER_ADD_GPU=true and CODER_USR_LIB_DIR are set, the usrLibGPUs() function should automatically detect and mount the NVIDIA libraries from the specified directory into the inner container, without requiring manual CODER_MOUNTS.
Possible Cause
The usrLibGPUs() function walks the CODER_USR_LIB_DIR directory looking for files matching (?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda) with .so extensions. When the host path is /usr/lib64 (common on RHEL/Amazon Linux), the library layout or symlink structure may differ from /usr/lib/x86_64-linux-gnu (Debian/Ubuntu), which is the path used in all integration tests. The symlink resolution in recursiveSymlinks() or the path remapping logic in the GPU bind mount code may not handle this correctly.
Additionally, in the Kubernetes runtimeClassName: nvidia path (vs Docker --runtime=nvidia --gpus=all), the NVIDIA device plugin may inject libraries differently than the NVIDIA container runtime does when invoked via Docker directly.
Created on behalf of @uzair-coder07