NVIDIADriver CR stuck NotReady after nodeSelector change

**Describe the bug**
When editing the nodeSelector field of an NVIDIADriver custom resource (CR), the resource enters a permanent NotReady state if the change does not result in any driver pods being updated. The GPU Operator keeps reconciling without making progress, repeatedly logging that the object is not ready.

**To Reproduce**
 1. Deploy an NVIDIADriver custom resource with a specific nodeSelector.
 2. Wait until it reaches the Ready state.
 3. Edit the nodeSelector so that the selector changes, but the set of matching nodes remains the same (for example, replace one matching label with another equivalent label).
 4. Observe that the NVIDIADriver CR goes into NotReady and remains stuck.

Additional scenario:
If the nodeSelector is changed to target a subset of the original nodes, Kubernetes will remove pods from the non-matching nodes, but the remaining pods are not updated. The CR still becomes stuck in NotReady.

**Expected behavior**
The GPU Operator should detect when a nodeSelector change does not result in pod updates and still mark the NVIDIADriver CR as Ready. It should handle scenarios where the DaemonSet spec changes but no pods need to be updated.

**Environment (please provide the following information):**
 - GPU Operator Version: v25.3.2
 - OS: Ubuntu 24.04
 - Kernel Version: 6.8.0-71-generic
 - Container Runtime Version: containerd 1.7.27
 - Kubernetes Distro and Version: Kubernetes v1.32.4

**Logs / References**
Excerpt from operator logs (looping indefinitely):
```
{"level":"info","ts":1756998659.7785275,"logger":"state.state-driver","msg":"Object is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6","Kind:":"DaemonSet","Name":"nvidia-gpu-driver-ubuntu24.04-8df5fc75"}
{"level":"info","ts":1756998659.7785566,"msg":"Sync not Done for custom resource","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6"}
{"level":"info","ts":1756998659.7786314,"msg":"NVIDIADriver instance is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6b8b24a7-6f09-468e-b73c-e4757e847bb6"}
```

Readiness check code reference:
[internal/state/state_skel.go#L439](https://github.com/NVIDIA/gpu-operator/blob/616690d88d86bb4e88897b6775dfe76a6c15c741/internal/state/state_skel.go#L439)
```
if ds.Status.DesiredNumberScheduled != 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberAvailable &&
	ds.Status.UpdatedNumberScheduled == ds.Status.NumberAvailable {
	return true, nil
}
```
In this situation, `ds.Status.UpdatedNumberScheduled` remains zero, causing readiness to never succeed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIADriver CR stuck NotReady after nodeSelector change #1661

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVIDIADriver CR stuck NotReady after nodeSelector change #1661

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions