abandon evaluation of any new catalogsource image which is pathological#3766
abandon evaluation of any new catalogsource image which is pathological#3766grokspawn wants to merge 2 commits intooperator-framework:masterfrom
Conversation
…y restrts Signed-off-by: grokspawn <jordan@nimblewidget.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| } | ||
|
|
||
| // podPathologicallyRestarting returns true if any container in the pod has restarts exceeding pathologicalRestartCountThreshold. | ||
| func podPathologicallyRestarting(pod *corev1.Pod) bool { |
There was a problem hiding this comment.
Should this also (or instead) check for CrashLoopBackoff?
There are cases like node reboots that cause the container restart counter to increment because something killed the containers, not be cause the containers were crashing.
There was a problem hiding this comment.
My concern with doing that is that the threshold would be immediate. If the pod went CLBO we would immediately destroy the pod and pull a new image at the next interval. In the past, we have been very careful about creating a state where we would need to pull the image again, but if we're OK with that I can.
There was a problem hiding this comment.
At any rate, I added that consideration in a new commit so we could talk about it. PTAL.
There was a problem hiding this comment.
pull a new image at the next interval
At the next interval, there are two possibilities:
- It's the same digest. If the pod is scheduled to a new node, then we'd re-pull that image to that new node. If the pod is scheduled on a node that had pulled that digest before, no pull necessary. Eventually, all the nodes could end up pulling the bad digest, but each node would only pull the bad digest once.
- It's a different digest. We want to pull this to get past the digest with the bad data.
| func podContainersArePathological(pod *corev1.Pod) bool { | ||
| for _, s := range pod.Status.ContainerStatuses { | ||
| if s.RestartCount > pathologicalRestartCountThreshold { | ||
| if s.RestartCount > pathologicalRestartCountThreshold || containerInCrashLoopBackOff(s) { |
There was a problem hiding this comment.
You could maybe make this && instead of || so that both conditions have to be true in order to kill it.
Assuming no restarts due to other reasons, how long does it take a new pod to get to 10 restarts when it is immediately crashing? The 'Backoff' part of CrashLookBackoff means that we wait longer and longer between restarts.
There was a problem hiding this comment.
Based on your review, I ended up dropping the restart-count-based approach entirely. The timing previously would be ready-probe+liveness-probe * threshold.
For my test scenario with a 5m pollInterval yesterday, we hit the threshold on my local machine at about 20 minutes.
Signed-off-by: grokspawn <jordan@nimblewidget.com>
9985929 to
1098551
Compare
Description of the change:
This PR adds a pathological status check for when container status indicates crashloopbackoff.
If that status matches, OLM will abandon catalog evaluation for that image, dispose of the pod, and pull a new image when the
pollIntervalcomes up again.This will now cause the catalog operator to emit a message of the form
and the offending pod will be deleted.
Motivation for the change:
In the case that a catalogsource defines
.spec.grpcPodConfig.extractContentit is possible for OLMv0 to get trapped in an evaluation loop if the catalogsource is not compatible with the on-cluster catalogsource service.This is because the on-cluster catalog services which use
extractContentdefine two initContainers and a service container. When those initContainers succeed, the pod status progresses to RUNNING regardless of the success/failure of the service container.If the service container fails, it will halt, and the pod will start being rebooted by kube when it fails readiness/liveness probes. It will remain in RUNNING status, so OLM will requeue its evaluation without end.
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc[FLAKE]are truly flaky and have an issue