-
Notifications
You must be signed in to change notification settings - Fork 3.4k
feat(controller): auto-restart pods that failed before starting #15086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds automatic pod restart functionality to handle infrastructure failures that occur before the main container starts running. The feature automatically recreates pods that fail due to issues like node evictions, disk pressure, or admission errors, without requiring a retryStrategy configuration.
Key Changes
- Introduced
FailedPodRestartConfigwithenabled,maxRestarts, andbackoffSecondssettings in controller configuration - Added pod restart detection logic that checks if main container never entered Running state for restartable failure reasons (Evicted, NodeShutdown, NodeAffinity, UnexpectedAdmissionError)
- Implemented
pod_restarts_totalmetric with reason, condition, and namespace labels to track automatic restarts
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
workflow/controller/pod_restart.go |
Core logic for analyzing pods and determining if they qualify for automatic restart |
workflow/controller/operator.go |
Integration point that detects failed pods and triggers restart by deleting pod and marking node as Pending |
workflow/metrics/counter_pod_restart.go |
Metric recording implementation with condition extraction from pod status messages |
workflow/metrics/counter_pod_restart_test.go |
Unit tests for condition extraction logic |
workflow/controller/pod_restart_test.go |
Unit tests covering restart eligibility detection for various pod states |
test/e2e/pod_restart_test.go |
E2E test simulating pod eviction and verifying successful restart and workflow completion |
test/e2e/metrics_test.go |
E2E test validating pod restart metrics are correctly recorded |
config/config.go |
Configuration structure with helper methods for feature enablement and settings |
util/telemetry/ |
Telemetry infrastructure for the new pod_restarts_total metric |
pkg/apis/workflow/v1alpha1/ |
API schema additions for FailedPodRestarts field in NodeStatus |
sdks/python/client/ |
Python SDK updates for new NodeStatus field |
sdks/java/client/ |
Java SDK updates for new NodeStatus field |
docs/pod-restarts.md |
New documentation page explaining the feature, configuration, and usage |
docs/metrics.md |
Documentation for the new pod_restarts_total metric |
docs/workflow-controller-configmap.md |
Configuration reference for FailedPodRestartConfig |
docs/retries.md |
Cross-reference note directing readers to pod restart feature for infrastructure failures |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
sdks/python/client/argo_workflows/model/io_argoproj_workflow_v1alpha1_node_status.py
Outdated
Show resolved
Hide resolved
sdks/python/client/docs/IoArgoprojWorkflowV1alpha1NodeStatus.md
Outdated
Show resolved
Hide resolved
32da496 to
b7c8e99
Compare
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Alan Clucas <alan@clucas.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
Fixes #12572
Motivation
A retryStrategy can restart pods, but sometimes this is not desired or hard to implement everywhere.
Add a new restart strategy which will recreate pods when they fail prior to starting
mainModifications
Adds automatic restart for pods that fail due to infrastructure issues before the main container starts. This handles transient failures like node evictions, disk pressure, or unexpected admission errors without requiring a retryStrategy.
When a pod fails before its main container enters Running state, the controller checks if the failure reason indicates an infrastructure issue. If so, the pod is deleted and the node is marked Pending to recreate it.
Restartable failure reasons:
You must enable this in workflow-controller-configmap:
Added a metric to track this firing
FailedPodRestartscounterpod_restarts_totalmetric with reason, condition, and namespace labelsVerification
New unit tests which are mostly pretty lightweight.
E2e tests for the metrics and the actual feature, which fakes pod eviction.
Documentation
Added a new doc page