Skip to content

abandon evaluation of any new catalogsource image which is pathological#3766

Open
grokspawn wants to merge 2 commits intooperator-framework:masterfrom
grokspawn:fix-wedged-newcat-eval
Open

abandon evaluation of any new catalogsource image which is pathological#3766
grokspawn wants to merge 2 commits intooperator-framework:masterfrom
grokspawn:fix-wedged-newcat-eval

Conversation

@grokspawn
Copy link
Contributor

@grokspawn grokspawn commented Feb 11, 2026

Description of the change:

This PR adds a pathological status check for when container status indicates crashloopbackoff.
If that status matches, OLM will abandon catalog evaluation for that image, dispose of the pod, and pull a new image when the pollInterval comes up again.

This will now cause the catalog operator to emit a message of the form

time="2026-02-11T22:30:17Z" level=info msg="catalog polling result: update pod poison-pill-9qvmp failed to start" UpdatePod=poison-pill-9qvmp

and the offending pod will be deleted.

Motivation for the change:

In the case that a catalogsource defines .spec.grpcPodConfig.extractContent it is possible for OLMv0 to get trapped in an evaluation loop if the catalogsource is not compatible with the on-cluster catalogsource service.

This is because the on-cluster catalog services which use extractContent define two initContainers and a service container. When those initContainers succeed, the pod status progresses to RUNNING regardless of the success/failure of the service container.

If the service container fails, it will halt, and the pod will start being rebooted by kube when it fails readiness/liveness probes. It will remain in RUNNING status, so OLM will requeue its evaluation without end.

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

…y restrts

Signed-off-by: grokspawn <jordan@nimblewidget.com>
@openshift-ci
Copy link

openshift-ci bot commented Feb 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tmshort for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

}

// podPathologicallyRestarting returns true if any container in the pod has restarts exceeding pathologicalRestartCountThreshold.
func podPathologicallyRestarting(pod *corev1.Pod) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also (or instead) check for CrashLoopBackoff?

There are cases like node reboots that cause the container restart counter to increment because something killed the containers, not be cause the containers were crashing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with doing that is that the threshold would be immediate. If the pod went CLBO we would immediately destroy the pod and pull a new image at the next interval. In the past, we have been very careful about creating a state where we would need to pull the image again, but if we're OK with that I can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any rate, I added that consideration in a new commit so we could talk about it. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull a new image at the next interval

At the next interval, there are two possibilities:

  1. It's the same digest. If the pod is scheduled to a new node, then we'd re-pull that image to that new node. If the pod is scheduled on a node that had pulled that digest before, no pull necessary. Eventually, all the nodes could end up pulling the bad digest, but each node would only pull the bad digest once.
  2. It's a different digest. We want to pull this to get past the digest with the bad data.

func podContainersArePathological(pod *corev1.Pod) bool {
for _, s := range pod.Status.ContainerStatuses {
if s.RestartCount > pathologicalRestartCountThreshold {
if s.RestartCount > pathologicalRestartCountThreshold || containerInCrashLoopBackOff(s) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could maybe make this && instead of || so that both conditions have to be true in order to kill it.

Assuming no restarts due to other reasons, how long does it take a new pod to get to 10 restarts when it is immediately crashing? The 'Backoff' part of CrashLookBackoff means that we wait longer and longer between restarts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your review, I ended up dropping the restart-count-based approach entirely. The timing previously would be ready-probe+liveness-probe * threshold.
For my test scenario with a 5m pollInterval yesterday, we hit the threshold on my local machine at about 20 minutes.

Signed-off-by: grokspawn <jordan@nimblewidget.com>
@grokspawn grokspawn force-pushed the fix-wedged-newcat-eval branch from 9985929 to 1098551 Compare February 12, 2026 15:56
@grokspawn grokspawn changed the title abandon evaluation of any new catalogsource image which pathologically restarts abandon evaluation of any new catalogsource image which is pathological Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants