test(e2e): classify OpenClaw live switch timeouts by cv · Pull Request #4173 · NVIDIA/NemoClaw

cv · 2026-05-25T07:18:47Z

Summary

The latest nightly flake sweep shows openclaw-inference-switch-e2e repeatedly passing route/config/hash assertions, then failing during post-switch live requests when inference.local or the OpenClaw agent turn times out. This PR mirrors the Hermes stabilization by keeping route/config regressions blocking while classifying explicit post-switch live timeout/5xx probes as transient skips.

Changes

Capture HTTP status for the post-switch OpenClaw inference.local probe.
Track transient state structurally from curl exit 28 or HTTP 502/503/504.
Convert post-switch inference.local transient exhaustion to SKIP after route/config/session checks have passed.
Convert OpenClaw agent command timeout (exit 124) to SKIP after route/config/session checks have passed.
Preserve FAIL for wrong-content responses, unexpected HTTP statuses, and all route/config/hash/session regressions.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

Tests
- Improved test resilience by implementing transient failure detection for HTTP status codes (502/503/504), distinguishing temporary from permanent errors
- Enhanced timeout handling to properly classify timeout scenarios in endpoint tests
- Added utilities for more robust HTTP response parsing and status classification

copy-pr-bot · 2026-05-25T07:18:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-25T07:19:00Z

📝 Walkthrough

Walkthrough

This PR enhances OpenClaw inference test reliability by distinguishing transient failures (timeouts, HTTP 502/503/504) from permanent failures. New HTTP response parsing helpers classify transient conditions, sandbox inference checks now skip on transient failures instead of failing, and agent turn checks skip on command timeout.

Changes

Transient HTTP Error Classification and Timeout Handling

Layer / File(s)	Summary
HTTP response parsing and transient classification helpers `test/e2e/test-openclaw-inference-switch.sh`	Three utility functions classify transient HTTP codes (502/503/504) and extract HTTP status and body from combined curl response strings.
Sandbox inference transient failure tracking `test/e2e/test-openclaw-inference-switch.sh`	Extended `check_sandbox_inference` to run a remote curl wrapper that separates response body from HTTP status, track transient state across retries using the new helpers, and change outcomes to skip for transient failures or fail for non-transient failures.
Agent turn command timeout handling `test/e2e/test-openclaw-inference-switch.sh`	Modified `check_openclaw_agent_turn` to record skip outcomes for SSH command timeouts (exit code 124) while preserving pass/fail behavior for other results.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

NVIDIA/NemoClaw#4158: Introduces the same curl response parsing helpers and transient failure classification logic in a separate inference script, using identical HTTP status and timeout detection patterns to change outcomes from fail to skip.
NVIDIA/NemoClaw#4154: Adds transient HTTP status classification (502/503/504) and refactors error handling to skip on transient conditions instead of fail in e2e retry logic.

Suggested labels

v0.0.51

Poem

🐰 Through timeouts and status codes we hop,
Transient troubles now gracefully skip,
No more false failures when servers take rest,
Just a gentle "skip" for the load test.
Smart inference waits, the fleet's on its way! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change—improving OpenClaw test resilience by classifying live switch timeouts as transient rather than failures.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/openclaw-inference-switch-live-timeouts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-25T07:19:44Z

E2E Advisor Recommendation

Required E2E: None
Optional E2E: openclaw-inference-switch-e2e

Dispatch hint: openclaw-inference-switch-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

None.

Optional E2E

openclaw-inference-switch-e2e (medium): Optional self-validation of the modified E2E script. This job is the direct workflow consumer of test/e2e/test-openclaw-inference-switch.sh and would catch shell, HTTP parsing, retry/skip, and live OpenClaw inference-switch assertion regressions introduced by the test change.

New E2E recommendations

None.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: openclaw-inference-switch-e2e

github-actions · 2026-05-25T07:19:45Z

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

None.

Relevant changed files

None.

github-actions · 2026-05-25T07:20:01Z

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 1 nice ideas
Top item: Mixed live-probe failures can be reported as SKIP

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Mixed live-probe failures can be reported as SKIP (test/e2e/test-openclaw-inference-switch.sh:218): `check_sandbox_inference` resets `transient=0` on every retry and only inspects the final attempt after the loop. If earlier attempts fail for a non-transient reason such as malformed JSON, wrong content, or an unexpected HTTP status, and the last attempt is a timeout or 502/503/504, the test reports SKIP. That can hide intermittent correctness regressions rather than only classifying an exhaustion of explicit transient failures as skipped.
- Recommendation: Track whether all exhausted attempts were transient, or fail immediately/at summary if any non-transient response was observed. For example, maintain `saw_non_transient=1` for wrong-content, parse, and unexpected-status failures and only SKIP when no non-transient attempts occurred.
- Evidence: The loop initializes `transient=0` per attempt, sets it for `rc == 28` or transient HTTP status, and after all attempts checks only `[ "$transient" -eq 1 ]` before calling `skip`; `last_fail` is also overwritten by the last attempt.

🌱 Nice ideas

Same E2E script has overlapping active PRs (test/e2e/test-openclaw-inference-switch.sh:1): The patched file still exists and this change applies directly, but trusted drift evidence shows two open PRs also modifying `test/e2e/test-openclaw-inference-switch.sh` (fix(e2e): handle top-level payloads in openclaw agent JSON output #4030 and fix(e2e): pass --no-verify to inference set in switch E2E tests #4143). Their changes may interact with this retry/skip classification logic.
- Recommendation: Before landing, reconcile this change with the active same-file E2E work so the final script consistently handles JSON output, `inference set` options, and transient live-probe classification.
- Evidence: Trusted context lists open PR overlaps fix(e2e): handle top-level payloads in openclaw agent JSON output #4030 and fix(e2e): pass --no-verify to inference set in switch E2E tests #4143 with `sameFiles: ["test/e2e/test-openclaw-inference-switch.sh"]`; recent history also shows ongoing stabilization of this file.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

coderabbitai

🧹 Nitpick comments (2)

test/e2e/test-openclaw-inference-switch.sh (2)
47-60: ⚡ Quick win

Consider extracting shared HTTP response helpers to a library.

These three functions are identical to test-hermes-inference-switch.sh (lines 47-61). Extracting them to test/e2e/lib/http-response-helpers.sh would reduce duplication and ensure consistent transient classification across all inference-switch E2Es.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-openclaw-inference-switch.sh` around lines 47 - 60, Extract the
three helper functions is_transient_live_http_code, http_status_from_response,
and http_body_from_response into a new shared file
test/e2e/lib/http-response-helpers.sh; replace the duplicated definitions in
test/e2e/test-openclaw-inference-switch.sh and
test/e2e/test-hermes-inference-switch.sh by sourcing that new file (e.g., .
"$(dirname "$0")/lib/http-response-helpers.sh" or similar), and ensure the
functions' behavior and names remain unchanged so transient classification is
consistent across both E2E scripts.
248-281: 💤 Low value

Last-attempt-wins transient detection is by design.

The transient flag resets on each attempt (line 248), so only the final attempt determines whether the outcome is SKIP or FAIL. This means if the first two attempts fail with non-transient errors but the third times out, the test will skip. This behavior aligns with the PR objective of treating post-switch live timeouts as non-blocking, but it could theoretically mask degradation patterns where permanent failures evolve into timeouts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-openclaw-inference-switch.sh` around lines 248 - 281, The
transient flag is overwritten each loop so only the last attempt decides SKIP vs
FAIL; preserve non-transient failures across attempts by adding a persistent
marker (e.g., non_transient_seen) or make transient sticky. Update the loop
around where transient, last_fail and attempt are set (variables transient,
last_fail, attempt and function is_transient_live_http_code) so that: when a
non-transient error is observed (HTTP != 200 and not
is_transient_live_http_code, or curl rc != 28), set non_transient_seen=1 (or do
not clear transient once set); when a transient condition is observed set
transient=1 but do not overwrite a previously-recorded non_transient_seen; after
the loop decide SKIP only if transient==1 and non_transient_seen is unset,
otherwise FAIL using the earliest/most relevant last_fail recorded. This ensures
any non-transient failure across attempts forces FAIL while still allowing final
transient timeouts to be SKIP when no non-transient failure occurred.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-openclaw-inference-switch.sh`:
- Around line 47-60: Extract the three helper functions
is_transient_live_http_code, http_status_from_response, and
http_body_from_response into a new shared file
test/e2e/lib/http-response-helpers.sh; replace the duplicated definitions in
test/e2e/test-openclaw-inference-switch.sh and
test/e2e/test-hermes-inference-switch.sh by sourcing that new file (e.g., .
"$(dirname "$0")/lib/http-response-helpers.sh" or similar), and ensure the
functions' behavior and names remain unchanged so transient classification is
consistent across both E2E scripts.
- Around line 248-281: The transient flag is overwritten each loop so only the
last attempt decides SKIP vs FAIL; preserve non-transient failures across
attempts by adding a persistent marker (e.g., non_transient_seen) or make
transient sticky. Update the loop around where transient, last_fail and attempt
are set (variables transient, last_fail, attempt and function
is_transient_live_http_code) so that: when a non-transient error is observed
(HTTP != 200 and not is_transient_live_http_code, or curl rc != 28), set
non_transient_seen=1 (or do not clear transient once set); when a transient
condition is observed set transient=1 but do not overwrite a previously-recorded
non_transient_seen; after the loop decide SKIP only if transient==1 and
non_transient_seen is unset, otherwise FAIL using the earliest/most relevant
last_fail recorded. This ensures any non-transient failure across attempts
forces FAIL while still allowing final transient timeouts to be SKIP when no
non-transient failure occurred.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 592c722f-364b-43dd-8f25-b4a4466e690a

📥 Commits

Reviewing files that changed from the base of the PR and between 50c208b and 82560f9.

📒 Files selected for processing (1)

test/e2e/test-openclaw-inference-switch.sh

github-actions · 2026-05-25T07:33:29Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26388552457
Target ref: 82560f9a1a74ebdd1520c4bcf06afdebf971bd97
Workflow ref: main
Requested jobs: openclaw-inference-switch-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
openclaw-inference-switch-e2e	✅ success

test(e2e): classify OpenClaw live switch timeouts

82560f9

cv self-assigned this May 25, 2026

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

cv added the v0.0.51 Release target label May 25, 2026

cv merged commit cab6f8c into main May 25, 2026
22 checks passed

cv deleted the fix/openclaw-inference-switch-live-timeouts branch May 27, 2026 21:16

Conversation

cv commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 25, 2026

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 25, 2026

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 25, 2026

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 25, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cv commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading