Skip to content

NVIDIADriver CR stuck NotReady after nodeSelector change #1661

@h323

Description

@h323

Describe the bug
When editing the nodeSelector field of an NVIDIADriver custom resource (CR), the resource enters a permanent NotReady state if the change does not result in any driver pods being updated. The GPU Operator keeps reconciling without making progress, repeatedly logging that the object is not ready.

To Reproduce

  1. Deploy an NVIDIADriver custom resource with a specific nodeSelector.
  2. Wait until it reaches the Ready state.
  3. Edit the nodeSelector so that the selector changes, but the set of matching nodes remains the same (for example, replace one matching label with another equivalent label).
  4. Observe that the NVIDIADriver CR goes into NotReady and remains stuck.

Additional scenario:
If the nodeSelector is changed to target a subset of the original nodes, Kubernetes will remove pods from the non-matching nodes, but the remaining pods are not updated. The CR still becomes stuck in NotReady.

Expected behavior
The GPU Operator should detect when a nodeSelector change does not result in pod updates and still mark the NVIDIADriver CR as Ready. It should handle scenarios where the DaemonSet spec changes but no pods need to be updated.

Environment (please provide the following information):

  • GPU Operator Version: v25.3.2
  • OS: Ubuntu 24.04
  • Kernel Version: 6.8.0-71-generic
  • Container Runtime Version: containerd 1.7.27
  • Kubernetes Distro and Version: Kubernetes v1.32.4

Logs / References
Excerpt from operator logs (looping indefinitely):

{"level":"info","ts":1756998659.7785275,"logger":"state.state-driver","msg":"Object is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6","Kind:":"DaemonSet","Name":"nvidia-gpu-driver-ubuntu24.04-8df5fc75"}
{"level":"info","ts":1756998659.7785566,"msg":"Sync not Done for custom resource","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6"}
{"level":"info","ts":1756998659.7786314,"msg":"NVIDIADriver instance is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6"}

Readiness check code reference:
internal/state/state_skel.go#L439

if ds.Status.DesiredNumberScheduled != 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberAvailable &&
	ds.Status.UpdatedNumberScheduled == ds.Status.NumberAvailable {
	return true, nil
}

In this situation, ds.Status.UpdatedNumberScheduled remains zero, causing readiness to never succeed.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions