Skip to content

GPU library auto-detection fails when host usr lib path is /usr/lib64, requiring manual CODER_MOUNTS #164

@blinkagent

Description

@blinkagent

Problem

When using envbox with GPU passthrough in Kubernetes (runtimeClassName: nvidia + nvidia.com/gpu resource limits), setting CODER_ADD_GPU=true and CODER_USR_LIB_DIR=/var/coder/usr/lib correctly passes through /dev/nvidia* device nodes to the inner container, but the automatic library detection via usrLibGPUs() does not mount the required NVIDIA libraries into the inner container.

As a result, nvidia-smi inside the inner container fails with:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.

The outer container's nvidia-smi works fine.

Workaround

Manually specifying the library mounts via CODER_MOUNTS resolves the issue:

- name: CODER_MOUNTS
  value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"

With both CODER_ADD_GPU=true (for device passthrough) and CODER_MOUNTS (for libraries), GPU passthrough works end-to-end without needing to manually recreate the inner container.

Environment

  • envbox version: 0.6.5
  • Kubernetes: runtimeClassName: nvidia with nvidia.com/gpu: "1" resource limits
  • GPU: Tesla T4
  • Host library path: /usr/lib64 mounted into the outer container at /var/coder/usr/lib
  • Inner image tested: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2

Pod Spec (relevant sections)

spec:
  runtimeClassName: nvidia
  containers:
  - image: ghcr.io/coder/envbox:0.6.5
    env:
    - name: CODER_INNER_IMAGE
      value: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    - name: CODER_INNER_USERNAME
      value: root
    - name: CODER_ADD_GPU
      value: "true"
    - name: CODER_USR_LIB_DIR
      value: /var/coder/usr/lib
    - name: CODER_MOUNTS
      value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
    resources:
      limits:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /var/coder/usr/lib
      name: usr-lib
  volumes:
  - hostPath:
      path: /usr/lib64
      type: Directory
    name: usr-lib

Expected Behavior

When CODER_ADD_GPU=true and CODER_USR_LIB_DIR are set, the usrLibGPUs() function should automatically detect and mount the NVIDIA libraries from the specified directory into the inner container, without requiring manual CODER_MOUNTS.

Possible Cause

The usrLibGPUs() function walks the CODER_USR_LIB_DIR directory looking for files matching (?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda) with .so extensions. When the host path is /usr/lib64 (common on RHEL/Amazon Linux), the library layout or symlink structure may differ from /usr/lib/x86_64-linux-gnu (Debian/Ubuntu), which is the path used in all integration tests. The symlink resolution in recursiveSymlinks() or the path remapping logic in the GPU bind mount code may not handle this correctly.

Additionally, in the Kubernetes runtimeClassName: nvidia path (vs Docker --runtime=nvidia --gpus=all), the NVIDIA device plugin may inject libraries differently than the NVIDIA container runtime does when invoked via Docker directly.


Created on behalf of @uzair-coder07

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions