Skip to content

Skip device plugin alert when devicePlugin is disabled in ClusterPolicy#2238

Draft
harche wants to merge 1 commit intoNVIDIA:mainfrom
harche:fix/device-plugin-disabled-alert-false-positive
Draft

Skip device plugin alert when devicePlugin is disabled in ClusterPolicy#2238
harche wants to merge 1 commit intoNVIDIA:mainfrom
harche:fix/device-plugin-disabled-alert-false-positive

Conversation

@harche
Copy link

@harche harche commented Mar 20, 2026

Summary

When devicePlugin.enabled is set to false in the ClusterPolicy, the nvidia-node-status-exporter still monitors gpu_operator_node_device_plugin_devices_total, which reports 0 (since no device plugin pods are running). This triggers a false positive GPUOperatorNodeDeploymentFailed alert after 30 minutes.

This is a valid configuration — for example, when GPU allocation is managed externally via MIG partitioning through a third-party operator.

Changes

  • controllers/object_controls.go: The operator now injects a DEVICE_PLUGIN_ENABLED env var into the node-status-exporter daemonset based on config.DevicePlugin.IsEnabled().
  • cmd/nvidia-validator/metrics.go: When DEVICE_PLUGIN_ENABLED=false, the exporter skips device plugin validation and sets the gauge to -1 (documented sentinel value), preventing the == 0 alert from firing.
  • assets/state-node-status-exporter/0800_prometheus_rule_openshift.yaml: Updated comment to document the behavior.
  • controllers/transforms_test.go: Unit tests for env var injection (enabled + disabled).
  • tests/e2e/helpers/clusterpolicy.go: EnableDevicePlugin/DisableDevicePlugin helpers.
  • tests/e2e/suites/clusterpolicy_test.go: E2E tests verifying env var propagation to node-status-exporter daemonset.

Test plan

  • Unit tests pass (TestTransformNodeStatusExporter)
  • Code compiles (go build ./cmd/nvidia-validator/ and go build ./controllers/)
  • E2E tests compile (go build ./tests/e2e/...)
  • Manual validation on hardware with MIG-capable GPUs and devicePlugin.enabled: false

Fixes: #2237

When devicePlugin.enabled is set to false in the ClusterPolicy, the
nvidia-node-status-exporter still monitors the device_plugin_devices_total
metric which reports 0 (since no device plugin pods are running). This
triggers a false positive GPUOperatorNodeDeploymentFailed alert.

Fix: The operator now injects a DEVICE_PLUGIN_ENABLED env var into the
node-status-exporter daemonset based on the ClusterPolicy. When set to
"false", the exporter skips device plugin validation entirely, so the
metric is never emitted and the alert does not fire.

Fixes: NVIDIA#2237

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@harche
Copy link
Author

harche commented Mar 20, 2026

Initial code review looks good and unit tests pass. Keeping this as a draft, hardware testing with MIG-capable GPUs and devicePlugin.enabled: false is pending before marking ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPUOperatorNodeDeploymentFailed alert false positive when devicePlugin is disabled in ClusterPolicy

1 participant