Skip to content

GPUOperatorNodeDeploymentFailed alert false positive when devicePlugin is disabled in ClusterPolicy #2237

@harche

Description

@harche

Description

The GPUOperatorNodeDeploymentFailed alert fires a false positive when the device plugin is intentionally disabled in the ClusterPolicy (spec.devicePlugin.enabled: false). This is a valid configuration — for example, when GPU allocation is managed externally or via MIG partitioning through a third-party operator.

Steps to Reproduce

  1. Deploy the GPU Operator with the following ClusterPolicy configuration:
    spec:
      devicePlugin:
        enabled: false
      nodeStatusExporter:
        enabled: true
      mig:
        strategy: mixed
      migManager:
        enabled: true
  2. Wait 30+ minutes
  3. Observe the GPUOperatorNodeDeploymentFailed alert firing

Root Cause

The nvidia-node-status-exporter monitors the metric gpu_operator_node_device_plugin_devices_total. When the device plugin is disabled, no device plugin pods run, so this metric reports 0. The alert rule:

gpu_operator_node_device_plugin_devices_total == 0

fires after 30 minutes, even though the device plugin is intentionally disabled and GPUs are fully functional.

The exporter does not check the ClusterPolicy to determine whether the device plugin is supposed to be running.

Expected Behavior

The GPUOperatorNodeDeploymentFailed alert should not fire when devicePlugin.enabled: false is set in the ClusterPolicy. Possible fixes:

  1. The node-status-exporter checks the ClusterPolicy and skips device plugin validation when it is disabled
  2. The alert rule includes a condition that excludes nodes/clusters where the device plugin is intentionally disabled
  3. The node-status-exporter does not emit the gpu_operator_node_device_plugin_devices_total metric when the device plugin is disabled

Workaround

Manually silence the GPUOperatorNodeDeploymentFailed alert in the cluster monitoring configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions