feat(controller): auto-restart pods that failed before starting #15086

Joibel · 2025-11-27T11:18:42Z

Motivation

A retryStrategy can restart pods, but sometimes this is not desired or hard to implement everywhere.
Add a new restart strategy which will recreate pods when they fail prior to starting main

Modifications

Adds automatic restart for pods that fail due to infrastructure issues before the main container starts. This handles transient failures like node evictions, disk pressure, or unexpected admission errors without requiring a retryStrategy.

When a pod fails before its main container enters Running state, the controller checks if the failure reason indicates an infrastructure issue. If so, the pod is deleted and the node is marked Pending to recreate it.

Restartable failure reasons:

Evicted (node pressure eviction)
NodeShutdown (graceful node shutdown)
NodeAffinity (node affinity/selector no longer matches)
UnexpectedAdmissionError

You must enable this in workflow-controller-configmap:

      failedPodRestart:
        enabled: true                        
        maxRestarts: 3

Added a metric to track this firing

Node status includes FailedPodRestarts counter
New pod_restarts_total metric with reason, condition, and namespace labels

Verification

New unit tests which are mostly pretty lightweight.
E2e tests for the metrics and the actual feature, which fakes pod eviction.

Documentation

Added a new doc page

Copilot

Pull request overview

This PR adds automatic pod restart functionality to handle infrastructure failures that occur before the main container starts running. The feature automatically recreates pods that fail due to issues like node evictions, disk pressure, or admission errors, without requiring a retryStrategy configuration.

Key Changes

Introduced FailedPodRestartConfig with enabled, maxRestarts, and backoffSeconds settings in controller configuration
Added pod restart detection logic that checks if main container never entered Running state for restartable failure reasons (Evicted, NodeShutdown, NodeAffinity, UnexpectedAdmissionError)
Implemented pod_restarts_total metric with reason, condition, and namespace labels to track automatic restarts

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
`workflow/controller/pod_restart.go`	Core logic for analyzing pods and determining if they qualify for automatic restart
`workflow/controller/operator.go`	Integration point that detects failed pods and triggers restart by deleting pod and marking node as Pending
`workflow/metrics/counter_pod_restart.go`	Metric recording implementation with condition extraction from pod status messages
`workflow/metrics/counter_pod_restart_test.go`	Unit tests for condition extraction logic
`workflow/controller/pod_restart_test.go`	Unit tests covering restart eligibility detection for various pod states
`test/e2e/pod_restart_test.go`	E2E test simulating pod eviction and verifying successful restart and workflow completion
`test/e2e/metrics_test.go`	E2E test validating pod restart metrics are correctly recorded
`config/config.go`	Configuration structure with helper methods for feature enablement and settings
`util/telemetry/`	Telemetry infrastructure for the new pod_restarts_total metric
`pkg/apis/workflow/v1alpha1/`	API schema additions for FailedPodRestarts field in NodeStatus
`sdks/python/client/`	Python SDK updates for new NodeStatus field
`sdks/java/client/`	Java SDK updates for new NodeStatus field
`docs/pod-restarts.md`	New documentation page explaining the feature, configuration, and usage
`docs/metrics.md`	Documentation for the new pod_restarts_total metric
`docs/workflow-controller-configmap.md`	Configuration reference for FailedPodRestartConfig
`docs/retries.md`	Cross-reference note directing readers to pod restart feature for infrastructure failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

docs/pod-restarts.md

sdks/python/client/argo_workflows/model/io_argoproj_workflow_v1alpha1_node_status.py

api/openapi-spec/swagger.json

docs/fields.md

docs/workflow-controller-configmap.md

sdks/python/client/docs/IoArgoprojWorkflowV1alpha1NodeStatus.md

docs/workflow-controller-configmap.md

config/config.go

pkg/apis/workflow/v1alpha1/workflow_types.go

api/jsonschema/schema.json

Signed-off-by: Alan Clucas <alan@clucas.org>

coderabbitai · 2025-12-04T15:38:51Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Joibel requested review from sarabala1979 and terrytangyuan as code owners November 27, 2025 11:18

Joibel added the area/controller Controller issues, panics label Nov 27, 2025

Joibel force-pushed the pod-restart branch from cdb7715 to 972dbfc Compare November 27, 2025 11:39

Joibel requested a review from Copilot November 27, 2025 11:59

Copilot started reviewing on behalf of Joibel November 27, 2025 12:00 View session

Copilot finished reviewing on behalf of Joibel November 27, 2025 12:03

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Joibel marked this pull request as draft November 27, 2025 14:26

Joibel force-pushed the pod-restart branch 2 times, most recently from 32da496 to b7c8e99 Compare November 27, 2025 14:59

claude and others added 10 commits December 4, 2025 15:34

feat(controller): auto-restart pods that failed before starting

9d59ae4

Signed-off-by: Alan Clucas <alan@clucas.org>

CI test

815345d

Signed-off-by: Alan Clucas <alan@clucas.org>

fix

b9c9c76

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: track UUID

e212a8c

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: codegen

7025bb8

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: by rabbitai

a5f8f6d

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: unit test

2417eb7

Signed-off-by: Alan Clucas <alan@clucas.org>

feat: delete by pod UID

3c1b9c1

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: codegen

01549c7

Signed-off-by: Alan Clucas <alan@clucas.org>

fix: more codgen

e91137e

Signed-off-by: Alan Clucas <alan@clucas.org>

Joibel force-pushed the pod-restart branch from 549eac1 to e91137e Compare December 4, 2025 15:38

Joibel marked this pull request as ready for review December 4, 2025 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(controller): auto-restart pods that failed before starting #15086

feat(controller): auto-restart pods that failed before starting #15086

Uh oh!

Joibel commented Nov 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 4, 2025

Review skipped

Other AI code review bot(s) detected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(controller): auto-restart pods that failed before starting #15086

Are you sure you want to change the base?

feat(controller): auto-restart pods that failed before starting #15086

Uh oh!

Conversation

Joibel commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verification

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 4, 2025

Review skipped

Other AI code review bot(s) detected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Joibel commented Nov 27, 2025 •

edited

Loading