Skip to content

Conversation

@Joibel
Copy link
Member

@Joibel Joibel commented Nov 27, 2025

Fixes #12572

Motivation

A retryStrategy can restart pods, but sometimes this is not desired or hard to implement everywhere.
Add a new restart strategy which will recreate pods when they fail prior to starting main

Modifications

Adds automatic restart for pods that fail due to infrastructure issues before the main container starts. This handles transient failures like node evictions, disk pressure, or unexpected admission errors without requiring a retryStrategy.

When a pod fails before its main container enters Running state, the controller checks if the failure reason indicates an infrastructure issue. If so, the pod is deleted and the node is marked Pending to recreate it.

Restartable failure reasons:

  • Evicted (node pressure eviction)
  • NodeShutdown (graceful node shutdown)
  • NodeAffinity (node affinity/selector no longer matches)
  • UnexpectedAdmissionError

You must enable this in workflow-controller-configmap:

      failedPodRestart:
        enabled: true                        
        maxRestarts: 3                                                                                  

Added a metric to track this firing

  • Node status includes FailedPodRestarts counter
  • New pod_restarts_total metric with reason, condition, and namespace labels

Verification

New unit tests which are mostly pretty lightweight.
E2e tests for the metrics and the actual feature, which fakes pod eviction.

Documentation

Added a new doc page

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic pod restart functionality to handle infrastructure failures that occur before the main container starts running. The feature automatically recreates pods that fail due to issues like node evictions, disk pressure, or admission errors, without requiring a retryStrategy configuration.

Key Changes

  • Introduced FailedPodRestartConfig with enabled, maxRestarts, and backoffSeconds settings in controller configuration
  • Added pod restart detection logic that checks if main container never entered Running state for restartable failure reasons (Evicted, NodeShutdown, NodeAffinity, UnexpectedAdmissionError)
  • Implemented pod_restarts_total metric with reason, condition, and namespace labels to track automatic restarts

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
workflow/controller/pod_restart.go Core logic for analyzing pods and determining if they qualify for automatic restart
workflow/controller/operator.go Integration point that detects failed pods and triggers restart by deleting pod and marking node as Pending
workflow/metrics/counter_pod_restart.go Metric recording implementation with condition extraction from pod status messages
workflow/metrics/counter_pod_restart_test.go Unit tests for condition extraction logic
workflow/controller/pod_restart_test.go Unit tests covering restart eligibility detection for various pod states
test/e2e/pod_restart_test.go E2E test simulating pod eviction and verifying successful restart and workflow completion
test/e2e/metrics_test.go E2E test validating pod restart metrics are correctly recorded
config/config.go Configuration structure with helper methods for feature enablement and settings
util/telemetry/ Telemetry infrastructure for the new pod_restarts_total metric
pkg/apis/workflow/v1alpha1/ API schema additions for FailedPodRestarts field in NodeStatus
sdks/python/client/ Python SDK updates for new NodeStatus field
sdks/java/client/ Java SDK updates for new NodeStatus field
docs/pod-restarts.md New documentation page explaining the feature, configuration, and usage
docs/metrics.md Documentation for the new pod_restarts_total metric
docs/workflow-controller-configmap.md Configuration reference for FailedPodRestartConfig
docs/retries.md Cross-reference note directing readers to pod restart feature for infrastructure failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@Joibel Joibel marked this pull request as draft November 27, 2025 14:26
@Joibel Joibel force-pushed the pod-restart branch 2 times, most recently from 32da496 to b7c8e99 Compare November 27, 2025 14:59
claude and others added 10 commits December 4, 2025 15:34
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
@coderabbitai
Copy link

coderabbitai bot commented Dec 4, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@Joibel Joibel marked this pull request as ready for review December 4, 2025 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/controller Controller issues, panics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient error? Pod was rejected: The node had condition: [DiskPressure].

2 participants