fix: retry transient registry errors in opm QE by bandrade · Pull Request #1272 · openshift/operator-framework-olm

bandrade · 2026-03-30T18:29:53Z

What changed

add retry handling to tests-extension's shared opmcli.Output() helper
retry up to 3 times with a short backoff when opm hits transient registry pull failures
keep the existing command and assertion flow unchanged for QE specs

Why it changed

The OLMv0 opm QE coverage can fail on transient Quay pull errors while rendering catalog content. One observed failure was opm alpha render-graph quay.io/openshifttest/nginxolm-operator-index:v1 --use-http exiting after a 503 Service Unavailable response while fetching a blob.

Impact

This reduces test flakiness for external registry hiccups without masking persistent failures. Non-transient command errors still fail immediately.

Root cause

The shared opmcli helper executed image-backed opm commands only once, so temporary registry/network failures surfaced as hard test failures.

Validation

go test ./test/qe/util/opmcli

openshift-ci · 2026-03-30T18:29:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-03-30T18:30:03Z

Walkthrough

Modified (*CLI).Output() to retry command execution up to 3 times for transient registry/network errors, sleeping 2s between attempts. Adjusted logging and error propagation, added isTransientRegistryError(), and imported errors and time.

Changes

Cohort / File(s)	Summary
Retry logic and error handling `tests-extension/test/qe/util/opmcli/opm_client.go`	Rewrote `(*CLI).Output()` to loop up to 3 attempts, detect transient registry/network error substrings (502/503/504, 429, TLS handshake timeout, connection resets, EOF variants) via new `isTransientRegistryError()`, sleep 2s between retries, log "Running…" only on first attempt, and change error handling to use `errors.As` for `exec.ExitError` and return trimmed output on success. Added `errors` and `time` imports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-30T18:30:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bandrade
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 223-242: The current isTransientRegistryError function uses a bare
"EOF" in the transientErrors slice which can match substrings (e.g., "GEOFFREY")
and cause false-positive retries; update the logic in isTransientRegistryError
(and the transientErrors slice) to remove the broad "EOF" entry and instead
detect EOF only as a standalone token or exact error (e.g., use a regexp with
word boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed
equal "EOF"), preserving other entries like "unexpected EOF"; ensure you
reference and update the transientErrors variable and the loop that scans output
so only genuine EOF errors trigger a retry.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d66d2f34-7d6a-4ce6-bd66-98f36f80f588

📥 Commits

Reviewing files that changed from the base of the PR and between 6436231 and 00c118d.

📒 Files selected for processing (1)

tests-extension/test/qe/util/opmcli/opm_client.go

coderabbitai · 2026-03-30T19:53:41Z

tests-extension/test/qe/util/opmcli/opm_client.go

+func isTransientRegistryError(output string) bool {
+	transientErrors := []string{
+		"503 Service Unavailable",
+		"received unexpected HTTP status: 503",
+		"502 Bad Gateway",
+		"504 Gateway Timeout",
+		"429 Too Many Requests",
+		"TLS handshake timeout",
+		"i/o timeout",
+		"connection reset by peer",
+		"unexpected EOF",
+		"EOF",
 	}
-	// Execute command and capture combined output
-	out, err := cmd.CombinedOutput()
-	trimmed := strings.TrimSpace(string(out))
-	// Handle different types of execution results
-	switch exitErr := err.(type) {
-	case nil:
-		// Successful execution
-		c.stdout = bytes.NewBuffer(out)
-		return trimmed, nil
-	case *exec.ExitError:
-		// Command executed but returned non-zero exit code
-		e2e.Logf("Error running %v:\n%s", cmd, trimmed)
-		return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed}
-	default:
-		// Fatal error preventing command execution
-		FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err))
-		// unreachable code
-		return "", nil
+	for _, errMsg := range transientErrors {
+		if strings.Contains(output, errMsg) {
+			return true
+		}
 	}
+	return false
 }


⚠️ Potential issue | 🟡 Minor

Overly broad "EOF" substring match may cause false positive retries.

The bare "EOF" pattern matches any string containing those characters (e.g., "GEOFFREY", "EOF_MARKER", or unrelated error messages). This could trigger retries on non-transient failures.

Consider using more specific patterns that reflect actual error messages from Go's standard library or container runtimes:

Proposed fix

transientErrors := []string{ "503 Service Unavailable", "received unexpected HTTP status: 503", "502 Bad Gateway", "504 Gateway Timeout", "429 Too Many Requests", "TLS handshake timeout", "i/o timeout", "connection reset by peer", - "unexpected EOF", - "EOF", + "unexpected EOF", + ": EOF", }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func isTransientRegistryError(output string) bool {

transientErrors := []string{

"503 Service Unavailable",

"received unexpected HTTP status: 503",

"502 Bad Gateway",

"504 Gateway Timeout",

"429 Too Many Requests",

"TLS handshake timeout",

"i/o timeout",

"connection reset by peer",

"unexpected EOF",

"EOF",

}

// Execute command and capture combined output

out, err := cmd.CombinedOutput()

trimmed := strings.TrimSpace(string(out))

// Handle different types of execution results

switch exitErr := err.(type) {

case nil:

// Successful execution

c.stdout = bytes.NewBuffer(out)

return trimmed, nil

case *exec.ExitError:

// Command executed but returned non-zero exit code

e2e.Logf("Error running %v:\n%s", cmd, trimmed)

return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed}

default:

// Fatal error preventing command execution

FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err))

// unreachable code

return "", nil

for _, errMsg := range transientErrors {

if strings.Contains(output, errMsg) {

return true

}

}

return false

}

func isTransientRegistryError(output string) bool {

transientErrors := []string{

"503 Service Unavailable",

"received unexpected HTTP status: 503",

"502 Bad Gateway",

"504 Gateway Timeout",

"429 Too Many Requests",

"TLS handshake timeout",

"i/o timeout",

"connection reset by peer",

"unexpected EOF",

": EOF",

}

for _, errMsg := range transientErrors {

if strings.Contains(output, errMsg) {

return true

}

}

return false

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 223 - 242, The current isTransientRegistryError function uses a bare "EOF" in the transientErrors slice which can match substrings (e.g., "GEOFFREY") and cause false-positive retries; update the logic in isTransientRegistryError (and the transientErrors slice) to remove the broad "EOF" entry and instead detect EOF only as a standalone token or exact error (e.g., use a regexp with word boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed equal "EOF"), preserving other entries like "unexpected EOF"; ensure you reference and update the transientErrors variable and the loop that scans output so only genuine EOF errors trigger a retry.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests-extension/test/qe/util/opmcli/opm_client.go (1)
220-235: ⚠️ Potential issue | 🟠 Major

Don't match a bare EOF with strings.Contains.

Line 231 adds "EOF", and Line 234 will retry any output containing that token. That broadens retries beyond transient registry failures and can mask persistent errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 220 - 235,
The isTransientRegistryError function currently treats any output containing the
bare token "EOF" as transient via the transientErrors slice; remove the broad
"EOF" entry (or replace it with a precise check) so only true transient cases
are matched — e.g. delete "EOF" from transientErrors and, if you still need to
treat an exact EOF-only output as transient, add an explicit check like
strings.TrimSpace(output) == "EOF" or a regex with word boundaries instead of
using strings.Contains; update the transientErrors slice and the matching logic
in isTransientRegistryError accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 162-178: The stdin buffer is reused across retries so
CombinedOutput() on a failed attempt leaves EOF for subsequent attempts; inside
the retry loop (before assigning cmd.Stdin/use of exec.Command) clone/reset
stdin for each attempt: if c.stdin != nil and is a bytes.Buffer (or supports
Bytes()), create a fresh bytes.NewBuffer(copyOfBytes) and assign that to
cmd.Stdin (instead of reusing c.stdin) so each retry gets the original payload;
update the retry loop around exec.Command creation (symbols: c.stdin,
exec.Command, CombinedOutput(), maxRetries) to perform this reset.

---

Duplicate comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 220-235: The isTransientRegistryError function currently treats
any output containing the bare token "EOF" as transient via the transientErrors
slice; remove the broad "EOF" entry (or replace it with a precise check) so only
true transient cases are matched — e.g. delete "EOF" from transientErrors and,
if you still need to treat an exact EOF-only output as transient, add an
explicit check like strings.TrimSpace(output) == "EOF" or a regex with word
boundaries instead of using strings.Contains; update the transientErrors slice
and the matching logic in isTransientRegistryError accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9fc45e8b-16ec-413d-8efe-60869cb49fef

📥 Commits

Reviewing files that changed from the base of the PR and between 00c118d and 19fadf7.

📒 Files selected for processing (1)

tests-extension/test/qe/util/opmcli/opm_client.go

coderabbitai · 2026-03-30T21:10:16Z

tests-extension/test/qe/util/opmcli/opm_client.go

+	var lastErr error
+	var lastTrimmed string
+
+	for attempt := 1; attempt <= maxRetries; attempt++ {
+		// Create the command with the executable and arguments
+		cmd := exec.Command(c.execPath, c.finalArgs...)
+		// Set registry authentication file if configured
+		if c.podmanAuthfile != "" {
+			cmd.Env = append(os.Environ(), "REGISTRY_AUTH_FILE="+c.podmanAuthfile)
+		}
+		// Set working directory if specified
+		if c.ExecCommandPath != "" {
+			e2e.Logf("set exec command path is %s\n", c.ExecCommandPath)
+			cmd.Dir = c.ExecCommandPath
+		}
+		// Set stdin buffer
+		cmd.Stdin = c.stdin


⚠️ Potential issue | 🟠 Major

Reset stdin for each retry.

Lines 177-178 reuse the same *bytes.Buffer across attempts. CombinedOutput() consumes cmd.Stdin, so a retried command will see EOF/empty input instead of the original payload.

Suggested fix

var lastErr error var lastTrimmed string + var stdinBytes []byte + if c.stdin != nil { + stdinBytes = append([]byte(nil), c.stdin.Bytes()...) + } for attempt := 1; attempt <= maxRetries; attempt++ { // Create the command with the executable and arguments cmd := exec.Command(c.execPath, c.finalArgs...) @@ - // Set stdin buffer - cmd.Stdin = c.stdin + // Reset stdin for each attempt + if stdinBytes != nil { + cmd.Stdin = bytes.NewReader(stdinBytes) + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 162 - 178, The stdin buffer is reused across retries so CombinedOutput() on a failed attempt leaves EOF for subsequent attempts; inside the retry loop (before assigning cmd.Stdin/use of exec.Command) clone/reset stdin for each attempt: if c.stdin != nil and is a bytes.Buffer (or supports Bytes()), create a fresh bytes.NewBuffer(copyOfBytes) and assign that to cmd.Stdin (instead of reusing c.stdin) so each retry gets the original payload; update the retry loop around exec.Command creation (symbols: c.stdin, exec.Command, CombinedOutput(), maxRetries) to perform this reset.

openshift-ci · 2026-03-31T00:59:26Z

@bandrade: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kuiwang02 · 2026-03-31T01:16:06Z

/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-extended-f1

openshift-ci · 2026-03-31T01:16:10Z

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-extended-f1

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/323eba70-2c9f-11f1-8e90-76f2081bffa1-0

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026

bandrade marked this pull request as ready for review March 30, 2026 19:48

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026

openshift-ci bot requested review from bentito and dtfranz March 30, 2026 19:49

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

fix: retry transient registry errors in opm QE

19fadf7

bandrade force-pushed the feature/retry-opm-registry-pulls branch from 00c118d to 19fadf7 Compare March 30, 2026 21:01

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry transient registry errors in opm QE#1272

fix: retry transient registry errors in opm QE#1272
bandrade wants to merge 1 commit intoopenshift:mainfrom
bandrade:feature/retry-opm-registry-pulls

bandrade commented Mar 30, 2026

Uh oh!

openshift-ci bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

kuiwang02 commented Mar 31, 2026

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bandrade commented Mar 30, 2026

What changed

Why it changed

Impact

Root cause

Validation

Uh oh!

openshift-ci bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

kuiwang02 commented Mar 31, 2026

Uh oh!

openshift-ci bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 30, 2026 •

edited

Loading