fix: retry transient registry errors in opm QE#1272
fix: retry transient registry errors in opm QE#1272bandrade wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
WalkthroughModified (*CLI).Output() to retry command execution up to 3 times for transient registry/network errors, sleeping 2s between attempts. Adjusted logging and error propagation, added isTransientRegistryError(), and imported errors and time. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bandrade The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 223-242: The current isTransientRegistryError function uses a bare
"EOF" in the transientErrors slice which can match substrings (e.g., "GEOFFREY")
and cause false-positive retries; update the logic in isTransientRegistryError
(and the transientErrors slice) to remove the broad "EOF" entry and instead
detect EOF only as a standalone token or exact error (e.g., use a regexp with
word boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed
equal "EOF"), preserving other entries like "unexpected EOF"; ensure you
reference and update the transientErrors variable and the loop that scans output
so only genuine EOF errors trigger a retry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d66d2f34-7d6a-4ce6-bd66-98f36f80f588
📒 Files selected for processing (1)
tests-extension/test/qe/util/opmcli/opm_client.go
| func isTransientRegistryError(output string) bool { | ||
| transientErrors := []string{ | ||
| "503 Service Unavailable", | ||
| "received unexpected HTTP status: 503", | ||
| "502 Bad Gateway", | ||
| "504 Gateway Timeout", | ||
| "429 Too Many Requests", | ||
| "TLS handshake timeout", | ||
| "i/o timeout", | ||
| "connection reset by peer", | ||
| "unexpected EOF", | ||
| "EOF", | ||
| } | ||
| // Execute command and capture combined output | ||
| out, err := cmd.CombinedOutput() | ||
| trimmed := strings.TrimSpace(string(out)) | ||
| // Handle different types of execution results | ||
| switch exitErr := err.(type) { | ||
| case nil: | ||
| // Successful execution | ||
| c.stdout = bytes.NewBuffer(out) | ||
| return trimmed, nil | ||
| case *exec.ExitError: | ||
| // Command executed but returned non-zero exit code | ||
| e2e.Logf("Error running %v:\n%s", cmd, trimmed) | ||
| return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed} | ||
| default: | ||
| // Fatal error preventing command execution | ||
| FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err)) | ||
| // unreachable code | ||
| return "", nil | ||
| for _, errMsg := range transientErrors { | ||
| if strings.Contains(output, errMsg) { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Overly broad "EOF" substring match may cause false positive retries.
The bare "EOF" pattern matches any string containing those characters (e.g., "GEOFFREY", "EOF_MARKER", or unrelated error messages). This could trigger retries on non-transient failures.
Consider using more specific patterns that reflect actual error messages from Go's standard library or container runtimes:
Proposed fix
transientErrors := []string{
"503 Service Unavailable",
"received unexpected HTTP status: 503",
"502 Bad Gateway",
"504 Gateway Timeout",
"429 Too Many Requests",
"TLS handshake timeout",
"i/o timeout",
"connection reset by peer",
- "unexpected EOF",
- "EOF",
+ "unexpected EOF",
+ ": EOF",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func isTransientRegistryError(output string) bool { | |
| transientErrors := []string{ | |
| "503 Service Unavailable", | |
| "received unexpected HTTP status: 503", | |
| "502 Bad Gateway", | |
| "504 Gateway Timeout", | |
| "429 Too Many Requests", | |
| "TLS handshake timeout", | |
| "i/o timeout", | |
| "connection reset by peer", | |
| "unexpected EOF", | |
| "EOF", | |
| } | |
| // Execute command and capture combined output | |
| out, err := cmd.CombinedOutput() | |
| trimmed := strings.TrimSpace(string(out)) | |
| // Handle different types of execution results | |
| switch exitErr := err.(type) { | |
| case nil: | |
| // Successful execution | |
| c.stdout = bytes.NewBuffer(out) | |
| return trimmed, nil | |
| case *exec.ExitError: | |
| // Command executed but returned non-zero exit code | |
| e2e.Logf("Error running %v:\n%s", cmd, trimmed) | |
| return trimmed, &ExitError{ExitError: exitErr, Cmd: c.execPath + " " + strings.Join(c.finalArgs, " "), StdErr: trimmed} | |
| default: | |
| // Fatal error preventing command execution | |
| FatalErr(fmt.Errorf("unable to execute %q: %v", c.execPath, err)) | |
| // unreachable code | |
| return "", nil | |
| for _, errMsg := range transientErrors { | |
| if strings.Contains(output, errMsg) { | |
| return true | |
| } | |
| } | |
| return false | |
| } | |
| func isTransientRegistryError(output string) bool { | |
| transientErrors := []string{ | |
| "503 Service Unavailable", | |
| "received unexpected HTTP status: 503", | |
| "502 Bad Gateway", | |
| "504 Gateway Timeout", | |
| "429 Too Many Requests", | |
| "TLS handshake timeout", | |
| "i/o timeout", | |
| "connection reset by peer", | |
| "unexpected EOF", | |
| ": EOF", | |
| } | |
| for _, errMsg := range transientErrors { | |
| if strings.Contains(output, errMsg) { | |
| return true | |
| } | |
| } | |
| return false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 223 - 242,
The current isTransientRegistryError function uses a bare "EOF" in the
transientErrors slice which can match substrings (e.g., "GEOFFREY") and cause
false-positive retries; update the logic in isTransientRegistryError (and the
transientErrors slice) to remove the broad "EOF" entry and instead detect EOF
only as a standalone token or exact error (e.g., use a regexp with word
boundaries like `\bEOF\b` or check for output == "EOF" or lines trimmed equal
"EOF"), preserving other entries like "unexpected EOF"; ensure you reference and
update the transientErrors variable and the loop that scans output so only
genuine EOF errors trigger a retry.
00c118d to
19fadf7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests-extension/test/qe/util/opmcli/opm_client.go (1)
220-235:⚠️ Potential issue | 🟠 MajorDon't match a bare
EOFwithstrings.Contains.Line 231 adds
"EOF", and Line 234 will retry any output containing that token. That broadens retries beyond transient registry failures and can mask persistent errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 220 - 235, The isTransientRegistryError function currently treats any output containing the bare token "EOF" as transient via the transientErrors slice; remove the broad "EOF" entry (or replace it with a precise check) so only true transient cases are matched — e.g. delete "EOF" from transientErrors and, if you still need to treat an exact EOF-only output as transient, add an explicit check like strings.TrimSpace(output) == "EOF" or a regex with word boundaries instead of using strings.Contains; update the transientErrors slice and the matching logic in isTransientRegistryError accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 162-178: The stdin buffer is reused across retries so
CombinedOutput() on a failed attempt leaves EOF for subsequent attempts; inside
the retry loop (before assigning cmd.Stdin/use of exec.Command) clone/reset
stdin for each attempt: if c.stdin != nil and is a bytes.Buffer (or supports
Bytes()), create a fresh bytes.NewBuffer(copyOfBytes) and assign that to
cmd.Stdin (instead of reusing c.stdin) so each retry gets the original payload;
update the retry loop around exec.Command creation (symbols: c.stdin,
exec.Command, CombinedOutput(), maxRetries) to perform this reset.
---
Duplicate comments:
In `@tests-extension/test/qe/util/opmcli/opm_client.go`:
- Around line 220-235: The isTransientRegistryError function currently treats
any output containing the bare token "EOF" as transient via the transientErrors
slice; remove the broad "EOF" entry (or replace it with a precise check) so only
true transient cases are matched — e.g. delete "EOF" from transientErrors and,
if you still need to treat an exact EOF-only output as transient, add an
explicit check like strings.TrimSpace(output) == "EOF" or a regex with word
boundaries instead of using strings.Contains; update the transientErrors slice
and the matching logic in isTransientRegistryError accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9fc45e8b-16ec-413d-8efe-60869cb49fef
📒 Files selected for processing (1)
tests-extension/test/qe/util/opmcli/opm_client.go
| var lastErr error | ||
| var lastTrimmed string | ||
|
|
||
| for attempt := 1; attempt <= maxRetries; attempt++ { | ||
| // Create the command with the executable and arguments | ||
| cmd := exec.Command(c.execPath, c.finalArgs...) | ||
| // Set registry authentication file if configured | ||
| if c.podmanAuthfile != "" { | ||
| cmd.Env = append(os.Environ(), "REGISTRY_AUTH_FILE="+c.podmanAuthfile) | ||
| } | ||
| // Set working directory if specified | ||
| if c.ExecCommandPath != "" { | ||
| e2e.Logf("set exec command path is %s\n", c.ExecCommandPath) | ||
| cmd.Dir = c.ExecCommandPath | ||
| } | ||
| // Set stdin buffer | ||
| cmd.Stdin = c.stdin |
There was a problem hiding this comment.
Reset stdin for each retry.
Lines 177-178 reuse the same *bytes.Buffer across attempts. CombinedOutput() consumes cmd.Stdin, so a retried command will see EOF/empty input instead of the original payload.
Suggested fix
var lastErr error
var lastTrimmed string
+ var stdinBytes []byte
+ if c.stdin != nil {
+ stdinBytes = append([]byte(nil), c.stdin.Bytes()...)
+ }
for attempt := 1; attempt <= maxRetries; attempt++ {
// Create the command with the executable and arguments
cmd := exec.Command(c.execPath, c.finalArgs...)
@@
- // Set stdin buffer
- cmd.Stdin = c.stdin
+ // Reset stdin for each attempt
+ if stdinBytes != nil {
+ cmd.Stdin = bytes.NewReader(stdinBytes)
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-extension/test/qe/util/opmcli/opm_client.go` around lines 162 - 178,
The stdin buffer is reused across retries so CombinedOutput() on a failed
attempt leaves EOF for subsequent attempts; inside the retry loop (before
assigning cmd.Stdin/use of exec.Command) clone/reset stdin for each attempt: if
c.stdin != nil and is a bytes.Buffer (or supports Bytes()), create a fresh
bytes.NewBuffer(copyOfBytes) and assign that to cmd.Stdin (instead of reusing
c.stdin) so each retry gets the original payload; update the retry loop around
exec.Command creation (symbols: c.stdin, exec.Command, CombinedOutput(),
maxRetries) to perform this reset.
|
@bandrade: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-extended-f1 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/323eba70-2c9f-11f1-8e90-76f2081bffa1-0 |
What changed
tests-extension's sharedopmcli.Output()helperopmhits transient registry pull failuresWhy it changed
The OLMv0
opmQE coverage can fail on transient Quay pull errors while rendering catalog content. One observed failure wasopm alpha render-graph quay.io/openshifttest/nginxolm-operator-index:v1 --use-httpexiting after a503 Service Unavailableresponse while fetching a blob.Impact
This reduces test flakiness for external registry hiccups without masking persistent failures. Non-transient command errors still fail immediately.
Root cause
The shared
opmclihelper executed image-backedopmcommands only once, so temporary registry/network failures surfaced as hard test failures.Validation
go test ./test/qe/util/opmcli