abandon evaluation of any new catalogsource image which is pathological by grokspawn · Pull Request #3766 · operator-framework/operator-lifecycle-manager

grokspawn · 2026-02-11T22:43:20Z

Description of the change:

This PR adds a pathological status check for when container status indicates crashloopbackoff.
If that status matches, OLM will abandon catalog evaluation for that image, dispose of the pod, and pull a new image when the pollInterval comes up again.

This will now cause the catalog operator to emit a message of the form

time="2026-02-11T22:30:17Z" level=info msg="catalog polling result: update pod poison-pill-9qvmp failed to start" UpdatePod=poison-pill-9qvmp

and the offending pod will be deleted.

Motivation for the change:

In the case that a catalogsource defines .spec.grpcPodConfig.extractContent it is possible for OLMv0 to get trapped in an evaluation loop if the catalogsource is not compatible with the on-cluster catalogsource service.

This is because the on-cluster catalog services which use extractContent define two initContainers and a service container. When those initContainers succeed, the pod status progresses to RUNNING regardless of the success/failure of the service container.

If the service container fails, it will halt, and the pod will start being rebooted by kube when it fails readiness/liveness probes. It will remain in RUNNING status, so OLM will requeue its evaluation without end.

Architectural changes:

Testing remarks:

Reviewer Checklist

…y restrts Signed-off-by: grokspawn <jordan@nimblewidget.com>

openshift-ci · 2026-02-11T22:43:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tmshort for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

joelanford · 2026-02-12T03:15:05Z

pkg/controller/registry/reconciler/grpc.go

 }

+// podPathologicallyRestarting returns true if any container in the pod has restarts exceeding pathologicalRestartCountThreshold.
+func podPathologicallyRestarting(pod *corev1.Pod) bool {


Should this also (or instead) check for CrashLoopBackoff?

There are cases like node reboots that cause the container restart counter to increment because something killed the containers, not be cause the containers were crashing.

My concern with doing that is that the threshold would be immediate. If the pod went CLBO we would immediately destroy the pod and pull a new image at the next interval. In the past, we have been very careful about creating a state where we would need to pull the image again, but if we're OK with that I can.

At any rate, I added that consideration in a new commit so we could talk about it. PTAL.

pull a new image at the next interval

At the next interval, there are two possibilities:

It's the same digest. If the pod is scheduled to a new node, then we'd re-pull that image to that new node. If the pod is scheduled on a node that had pulled that digest before, no pull necessary. Eventually, all the nodes could end up pulling the bad digest, but each node would only pull the bad digest once.

It's a different digest. We want to pull this to get past the digest with the bad data.

joelanford · 2026-02-12T15:06:41Z

pkg/controller/registry/reconciler/grpc.go

+func podContainersArePathological(pod *corev1.Pod) bool {
 	for _, s := range pod.Status.ContainerStatuses {
-		if s.RestartCount > pathologicalRestartCountThreshold {
+		if s.RestartCount > pathologicalRestartCountThreshold || containerInCrashLoopBackOff(s) {


You could maybe make this && instead of || so that both conditions have to be true in order to kill it.

Assuming no restarts due to other reasons, how long does it take a new pod to get to 10 restarts when it is immediately crashing? The 'Backoff' part of CrashLookBackoff means that we wait longer and longer between restarts.

Based on your review, I ended up dropping the restart-count-based approach entirely. The timing previously would be ready-probe+liveness-probe * threshold.
For my test scenario with a 5m pollInterval yesterday, we hit the threshold on my local machine at about 20 minutes.

Signed-off-by: grokspawn <jordan@nimblewidget.com>

abandon evaluation of any new catalogsource image which pathologicall…

9c2c100

…y restrts Signed-off-by: grokspawn <jordan@nimblewidget.com>

openshift-ci bot requested review from kevinrizza and perdasilva February 11, 2026 22:43

joelanford reviewed Feb 12, 2026

View reviewed changes

add CLBO container considerations to detection

1098551

Signed-off-by: grokspawn <jordan@nimblewidget.com>

grokspawn force-pushed the fix-wedged-newcat-eval branch from 9985929 to 1098551 Compare February 12, 2026 15:56

grokspawn changed the title ~~abandon evaluation of any new catalogsource image which pathologically restarts~~ abandon evaluation of any new catalogsource image which is pathological Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abandon evaluation of any new catalogsource image which is pathological#3766

abandon evaluation of any new catalogsource image which is pathological#3766
grokspawn wants to merge 2 commits intooperator-framework:masterfrom
grokspawn:fix-wedged-newcat-eval

grokspawn commented Feb 11, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Feb 11, 2026

Uh oh!

joelanford Feb 12, 2026

Uh oh!

grokspawn Feb 12, 2026

Uh oh!

grokspawn Feb 12, 2026

Uh oh!

joelanford Feb 12, 2026

Uh oh!

joelanford Feb 12, 2026

Uh oh!

grokspawn Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grokspawn commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Feb 11, 2026

Uh oh!

joelanford Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

grokspawn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

grokspawn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

joelanford Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

joelanford Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

grokspawn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grokspawn commented Feb 11, 2026 •

edited

Loading