-
Notifications
You must be signed in to change notification settings - Fork 466
Description
Description
The GPUOperatorNodeDeploymentFailed alert fires a false positive when the device plugin is intentionally disabled in the ClusterPolicy (spec.devicePlugin.enabled: false). This is a valid configuration — for example, when GPU allocation is managed externally or via MIG partitioning through a third-party operator.
Steps to Reproduce
- Deploy the GPU Operator with the following ClusterPolicy configuration:
spec: devicePlugin: enabled: false nodeStatusExporter: enabled: true mig: strategy: mixed migManager: enabled: true
- Wait 30+ minutes
- Observe the
GPUOperatorNodeDeploymentFailedalert firing
Root Cause
The nvidia-node-status-exporter monitors the metric gpu_operator_node_device_plugin_devices_total. When the device plugin is disabled, no device plugin pods run, so this metric reports 0. The alert rule:
gpu_operator_node_device_plugin_devices_total == 0
fires after 30 minutes, even though the device plugin is intentionally disabled and GPUs are fully functional.
The exporter does not check the ClusterPolicy to determine whether the device plugin is supposed to be running.
Expected Behavior
The GPUOperatorNodeDeploymentFailed alert should not fire when devicePlugin.enabled: false is set in the ClusterPolicy. Possible fixes:
- The node-status-exporter checks the ClusterPolicy and skips device plugin validation when it is disabled
- The alert rule includes a condition that excludes nodes/clusters where the device plugin is intentionally disabled
- The node-status-exporter does not emit the
gpu_operator_node_device_plugin_devices_totalmetric when the device plugin is disabled
Workaround
Manually silence the GPUOperatorNodeDeploymentFailed alert in the cluster monitoring configuration.