-
Notifications
You must be signed in to change notification settings - Fork 466
Open
Labels
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
A clear and concise description of what the bug is.
Describe the bug
The nvidia-operator-validator pod stays in Pending/Init state and fails to recognize NVIDIA B300 GPUs. Even after upgrading to GPU Operator v25.10.1, the issue persists.
It seems the validator container does not have the PCI ID 3182 (Blackwell B300) in its internal database.
Environment information
- GPU: NVIDIA B300 SXM6 AC (8 GPUs)
- PCI ID:
3182 - Driver Version: 580.105.08 (Pre-installed on host)
- OS: Ubuntu 24.04.1 LTS
- Kubernetes Version: 1.33.9
- GPU Operator Version: v25.10.1
Log Output
The driver-validation init container shows the following warnings and loops:
time="2026-03-18T15:43:34Z" level=warning msg="unable to get device name: failed to find device with id '3182'\n"
time="2026-03-18T15:43:37Z" level=info msg="Creating link /host-dev-char/195:0 => /dev/nvidia0"
time="2026-03-18T15:43:37Z" level=warning msg="Could not create symlink: symlink /dev/nvidia0 /host-dev-char/195:0: file exists"
**To Reproduce**
Detailed steps to reproduce the issue.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment (please provide the following information):**
- GPU Operator Version: [e.g. v25.3.0]
- OS: [e.g. Ubuntu24.04]
- Kernel Version: [e.g. 6.8.0-generic]
- Container Runtime Version: [e.g. containerd 2.0.0]
- Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]
**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)
- [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
- [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
- [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
- [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
- [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
- [ ] containerd logs `journalctl -u containerd > containerd.log`
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.
This bundle can be submitted to us via email: **operator_feedback@nvidia.com**
Reactions are currently unavailable