Skip to content

fix: retry transient registry errors in opm QE#1272

Open
bandrade wants to merge 1 commit intoopenshift:mainfrom
bandrade:feature/retry-opm-registry-pulls
Open

fix: retry transient registry errors in opm QE#1272
bandrade wants to merge 1 commit intoopenshift:mainfrom
bandrade:feature/retry-opm-registry-pulls

Conversation

@bandrade
Copy link
Copy Markdown
Contributor

What changed

  • add retry handling to tests-extension's shared opmcli.Output() helper
  • retry up to 3 times with a short backoff when opm hits transient registry pull failures
  • keep the existing command and assertion flow unchanged for QE specs

Why it changed

The OLMv0 opm QE coverage can fail on transient Quay pull errors while rendering catalog content. One observed failure was opm alpha render-graph quay.io/openshifttest/nginxolm-operator-index:v1 --use-http exiting after a 503 Service Unavailable response while fetching a blob.

Impact

This reduces test flakiness for external registry hiccups without masking persistent failures. Non-transient command errors still fail immediately.

Root cause

The shared opmcli helper executed image-backed opm commands only once, so temporary registry/network failures surfaced as hard test failures.

Validation

  • go test ./test/qe/util/opmcli

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 30, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Walkthrough

Modified (*CLI).Output() to retry command execution up to 3 times for transient registry/network errors, sleeping 2s between attempts. Adjusted logging and error propagation, added isTransientRegistryError(), and imported errors and time.

Changes

Cohort / File(s) Summary
Retry logic and error handling
tests-extension/test/qe/util/opmcli/opm_client.go
Rewrote (*CLI).Output() to loop up to 3 attempts, detect transient registry/network error substrings (502/503/504, 429, TLS handshake timeout, connection resets, EOF variants) via new isTransientRegistryError(), sleep 2s between retries, log "Running…" only on first attempt, and change error handling to use errors.As for exec.ExitError and return trimmed output on success. Added errors and time imports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bandrade
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bandrade bandrade marked this pull request as ready for review March 30, 2026 19:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@openshift-ci openshift-ci bot requested review from bentito and dtfranz March 30, 2026 19:49
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 223-242: The current isTransientRegistryError function uses a bare
"EOF" in the transientErrors slice which can match substrings (e.g., "GEOFFREY")
and cause false-positive retries; update the logic in isTransientRegistryError
(and the transientErrors slice) to remove the broad "EOF" entry and instead
detect EOF only as a standalone token or exact error (e.g., use a regexp with
word boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed
equal "EOF"), preserving other entries like "unexpected EOF"; ensure you
reference and update the transientErrors variable and the loop that scans output
so only genuine EOF errors trigger a retry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d66d2f34-7d6a-4ce6-bd66-98f36f80f588

📥 Commits

Reviewing files that changed from the base of the PR and between 6436231 and 00c118d.

📒 Files selected for processing (1)
  • tests-extension/test/qe/util/opmcli/opm_client.go

Comment on lines +223 to 242
func isTransientRegistryError(output string) bool {
transientErrors := []string{
"503 Service Unavailable",
"received unexpected HTTP status: 503",
"502 Bad Gateway",
"504 Gateway Timeout",
"429 Too Many Requests",
"TLS handshake timeout",
"i/o timeout",
"connection reset by peer",
"unexpected EOF",
"EOF",
}
// Execute command and capture combined output
out, err := cmd.CombinedOutput()
trimmed := strings.TrimSpace(string(out))
// Handle different types of execution results
switch exitErr := err.(type) {
case nil:
// Successful execution
c.stdout = bytes.NewBuffer(out)
return trimmed, nil
case *exec.ExitError:
// Command executed but returned non-zero exit code
e2e.Logf("Error running %v:\n%s", cmd, trimmed)
return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed}
default:
// Fatal error preventing command execution
FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err))
// unreachable code
return "", nil
for _, errMsg := range transientErrors {
if strings.Contains(output, errMsg) {
return true
}
}
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Overly broad "EOF" substring match may cause false positive retries.

The bare "EOF" pattern matches any string containing those characters (e.g., "GEOFFREY", "EOF_MARKER", or unrelated error messages). This could trigger retries on non-transient failures.

Consider using more specific patterns that reflect actual error messages from Go's standard library or container runtimes:

Proposed fix
 	transientErrors := []string{
 		"503 Service Unavailable",
 		"received unexpected HTTP status: 503",
 		"502 Bad Gateway",
 		"504 Gateway Timeout",
 		"429 Too Many Requests",
 		"TLS handshake timeout",
 		"i/o timeout",
 		"connection reset by peer",
-		"unexpected EOF",
-		"EOF",
+		"unexpected EOF",
+		": EOF",
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func isTransientRegistryError(output string) bool {
transientErrors := []string{
"503 Service Unavailable",
"received unexpected HTTP status: 503",
"502 Bad Gateway",
"504 Gateway Timeout",
"429 Too Many Requests",
"TLS handshake timeout",
"i/o timeout",
"connection reset by peer",
"unexpected EOF",
"EOF",
}
// Execute command and capture combined output
out, err := cmd.CombinedOutput()
trimmed := strings.TrimSpace(string(out))
// Handle different types of execution results
switch exitErr := err.(type) {
case nil:
// Successful execution
c.stdout = bytes.NewBuffer(out)
return trimmed, nil
case *exec.ExitError:
// Command executed but returned non-zero exit code
e2e.Logf("Error running %v:\n%s", cmd, trimmed)
return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed}
default:
// Fatal error preventing command execution
FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err))
// unreachable code
return "", nil
for _, errMsg := range transientErrors {
if strings.Contains(output, errMsg) {
return true
}
}
return false
}
func isTransientRegistryError(output string) bool {
transientErrors := []string{
"503 Service Unavailable",
"received unexpected HTTP status: 503",
"502 Bad Gateway",
"504 Gateway Timeout",
"429 Too Many Requests",
"TLS handshake timeout",
"i/o timeout",
"connection reset by peer",
"unexpected EOF",
": EOF",
}
for _, errMsg := range transientErrors {
if strings.Contains(output, errMsg) {
return true
}
}
return false
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 223 - 242,
The current isTransientRegistryError function uses a bare "EOF" in the
transientErrors slice which can match substrings (e.g., "GEOFFREY") and cause
false-positive retries; update the logic in isTransientRegistryError (and the
transientErrors slice) to remove the broad "EOF" entry and instead detect EOF
only as a standalone token or exact error (e.g., use a regexp with word
boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed equal
"EOF"), preserving other entries like "unexpected EOF"; ensure you reference and
update the transientErrors variable and the loop that scans output so only
genuine EOF errors trigger a retry.

@bandrade bandrade force-pushed the feature/retry-opm-registry-pulls branch from 00c118d to 19fadf7 Compare March 30, 2026 21:01
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests-extension/test/qe/util/opmcli/opm_client.go (1)

220-235: ⚠️ Potential issue | 🟠 Major

Don't match a bare EOF with strings.Contains.

Line 231 adds "EOF", and Line 234 will retry any output containing that token. That broadens retries beyond transient registry failures and can mask persistent errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 220 - 235,
The isTransientRegistryError function currently treats any output containing the
bare token "EOF" as transient via the transientErrors slice; remove the broad
"EOF" entry (or replace it with a precise check) so only true transient cases
are matched — e.g. delete "EOF" from transientErrors and, if you still need to
treat an exact EOF-only output as transient, add an explicit check like
strings.TrimSpace(output) == "EOF" or a regex with word boundaries instead of
using strings.Contains; update the transientErrors slice and the matching logic
in isTransientRegistryError accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 162-178: The stdin buffer is reused across retries so
CombinedOutput() on a failed attempt leaves EOF for subsequent attempts; inside
the retry loop (before assigning cmd.Stdin/use of exec.Command) clone/reset
stdin for each attempt: if c.stdin != nil and is a bytes.Buffer (or supports
Bytes()), create a fresh bytes.NewBuffer(copyOfBytes) and assign that to
cmd.Stdin (instead of reusing c.stdin) so each retry gets the original payload;
update the retry loop around exec.Command creation (symbols: c.stdin,
exec.Command, CombinedOutput(), maxRetries) to perform this reset.

---

Duplicate comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 220-235: The isTransientRegistryError function currently treats
any output containing the bare token "EOF" as transient via the transientErrors
slice; remove the broad "EOF" entry (or replace it with a precise check) so only
true transient cases are matched — e.g. delete "EOF" from transientErrors and,
if you still need to treat an exact EOF-only output as transient, add an
explicit check like strings.TrimSpace(output) == "EOF" or a regex with word
boundaries instead of using strings.Contains; update the transientErrors slice
and the matching logic in isTransientRegistryError accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9fc45e8b-16ec-413d-8efe-60869cb49fef

📥 Commits

Reviewing files that changed from the base of the PR and between 00c118d and 19fadf7.

📒 Files selected for processing (1)
  • tests-extension/test/qe/util/opmcli/opm_client.go

Comment on lines +162 to +178
var lastErr error
var lastTrimmed string

for attempt := 1; attempt <= maxRetries; attempt++ {
// Create the command with the executable and arguments
cmd := exec.Command(c.execPath, c.finalArgs...)
// Set registry authentication file if configured
if c.podmanAuthfile != "" {
cmd.Env = append(os.Environ(), "REGISTRY_AUTH_FILE="+c.podmanAuthfile)
}
// Set working directory if specified
if c.ExecCommandPath != "" {
e2e.Logf("set exec command path is %s\n", c.ExecCommandPath)
cmd.Dir = c.ExecCommandPath
}
// Set stdin buffer
cmd.Stdin = c.stdin
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset stdin for each retry.

Lines 177-178 reuse the same *bytes.Buffer across attempts. CombinedOutput() consumes cmd.Stdin, so a retried command will see EOF/empty input instead of the original payload.

Suggested fix
 	var lastErr error
 	var lastTrimmed string
+	var stdinBytes []byte
+	if c.stdin != nil {
+		stdinBytes = append([]byte(nil), c.stdin.Bytes()...)
+	}
 
 	for attempt := 1; attempt <= maxRetries; attempt++ {
 		// Create the command with the executable and arguments
 		cmd := exec.Command(c.execPath, c.finalArgs...)
@@
-		// Set stdin buffer
-		cmd.Stdin = c.stdin
+		// Reset stdin for each attempt
+		if stdinBytes != nil {
+			cmd.Stdin = bytes.NewReader(stdinBytes)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 162 - 178,
The stdin buffer is reused across retries so CombinedOutput() on a failed
attempt leaves EOF for subsequent attempts; inside the retry loop (before
assigning cmd.Stdin/use of exec.Command) clone/reset stdin for each attempt: if
c.stdin != nil and is a bytes.Buffer (or supports Bytes()), create a fresh
bytes.NewBuffer(copyOfBytes) and assign that to cmd.Stdin (instead of reusing
c.stdin) so each retry gets the original payload; update the retry loop around
exec.Command creation (symbols: c.stdin, exec.Command, CombinedOutput(),
maxRetries) to perform this reset.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@bandrade: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kuiwang02
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-extended-f1

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-extended-f1

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/323eba70-2c9f-11f1-8e90-76f2081bffa1-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants