test(e2e): retry inference switch verification by cv · Pull Request #4152 · NVIDIA/NemoClaw

cv · 2026-05-24T07:27:41Z

Summary

The nightly flake sweep showed openclaw-inference-switch-e2e and hermes-inference-switch-e2e as high-frequency failures, mostly due to transient live endpoint verification timeouts during inference set. This PR retries transient verification failures and falls back to --no-verify only after repeated transient failures, while keeping the later route/config/live-request assertions as the real correctness gate.

Changes

Add shared inference-switch retry helpers in test/e2e/lib/inference-switch-retry.sh.
Validate NEMOCLAW_SWITCH_SET_ATTEMPTS as a positive integer before retrying.
Retry verified nemoclaw inference set / nemohermes inference set attempts when failures look transient.
After repeated transient verification failures, retry once with --no-verify; subsequent route, config, sandbox inference, and agent/API request checks still validate the switched route.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

copy-pr-bot · 2026-05-24T07:27:44Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-24T07:27:46Z

📝 Walkthrough

Walkthrough

A new shared Bash E2E helper library detects transient inference-switch failures (timeouts, connection errors, DNS issues, 5xx responses) and retries failed commands up to a configurable limit with linear backoff. Both Hermes and OpenClaw test scripts now integrate this helper, replacing direct inference set invocations with a retry wrapper during Phase 3.

Changes

Inference switch retry resilience

Layer / File(s)	Summary
Transient failure detection and retry helper `test/e2e/lib/inference-switch-retry.sh`	New helper module implements regex-based transient failure classification (timeouts, connection/reset/DNS errors, 5xx responses), configurable retry loop with linear backoff, optional logging via existing `info` function, and `--no-verify` fallback for final transient failures.
Hermes inference switch retry integration `test/e2e/test-hermes-inference-switch.sh`	Sources the retry helper and replaces direct `nemohermes inference set` invocation in Phase 3 with `run_inference_set_with_retry`, maintaining existing exit code capture and post-switch assertions.
OpenClaw inference switch retry integration `test/e2e/test-openclaw-inference-switch.sh`	Sources the retry helper and replaces direct `nemoclaw inference set` invocation in Phase 3 with `run_inference_set_with_retry`, maintaining existing exit code capture and route/config assertions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

E2E, CI/CD, status: rfr

Suggested reviewers

jyaunches

Poem

🐰 A hop, a retry, with wisdom so sweet,
Transient troubles now beat a retreat!
Backoff and fallback, through timeouts we race,
Inference switches now land with such grace. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately describes the main change: adding retry logic for inference switch verification in E2E tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch draft/inference-switch-retry-fallback

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-24T07:28:35Z

E2E Advisor Recommendation

Required E2E: None
Optional E2E: hermes-inference-switch-e2e, openclaw-inference-switch-e2e

Dispatch hint: hermes-inference-switch-e2e,openclaw-inference-switch-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

None.

Optional E2E

hermes-inference-switch-e2e (high; live cloud inference and sandbox install, timeout 60 minutes): Optional confidence check for the changed Hermes inference-switch E2E script and new shared retry helper; validates the helper works in the real workflow while preserving final route/config/live inference assertions.
openclaw-inference-switch-e2e (high; live cloud inference and sandbox install, timeout 45 minutes): Optional confidence check for the changed OpenClaw inference-switch E2E script and new shared retry helper; validates retry/fallback behavior around nemoclaw inference set plus live sandbox and agent requests after the switch.

New E2E recommendations

None.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: hermes-inference-switch-e2e,openclaw-inference-switch-e2e

github-actions · 2026-05-24T07:28:36Z

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

None.

Relevant changed files

None.

github-actions · 2026-05-24T07:29:27Z

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Coordinate overlapping inference-switch changes: The changed files still exist and this patch applies cleanly, but trusted drift context reports active open PRs touching the same inference-switch E2E scripts. Some overlapping work appears to use a different strategy, such as passing --no-verify directly, while this PR retries verified switches before falling back only after transient failures. Independent landing could overwrite, duplicate, or unintentionally combine inference-switch behavior.
- Recommendation: Before landing, rebase against or reconcile the same-file changes from the overlapping PRs so the final retry/fallback behavior is intentional.
- Evidence: Previous advisor overlap finding still applies. Trusted overlap context lists PR fix(e2e): pass --no-verify to inference set in switch E2E tests #4143 on both inference-switch scripts, PRs fix(e2e): skip redundant inference verify in hermes-inference-switch #4109 and fix(e2e): widen routing retry window in full-e2e security-posture test #4110 on test/e2e/test-hermes-inference-switch.sh, and PR fix(e2e): handle top-level payloads in openclaw agent JSON output #4030 on test/e2e/test-openclaw-inference-switch.sh. Drift evidence confirms the changed files still exist.

🌱 Nice ideas

None.

Since last review details

Current findings:

Coordinate overlapping inference-switch changes: The changed files still exist and this patch applies cleanly, but trusted drift context reports active open PRs touching the same inference-switch E2E scripts. Some overlapping work appears to use a different strategy, such as passing --no-verify directly, while this PR retries verified switches before falling back only after transient failures. Independent landing could overwrite, duplicate, or unintentionally combine inference-switch behavior.
- Recommendation: Before landing, rebase against or reconcile the same-file changes from the overlapping PRs so the final retry/fallback behavior is intentional.
- Evidence: Previous advisor overlap finding still applies. Trusted overlap context lists PR fix(e2e): pass --no-verify to inference set in switch E2E tests #4143 on both inference-switch scripts, PRs fix(e2e): skip redundant inference verify in hermes-inference-switch #4109 and fix(e2e): widen routing retry window in full-e2e security-posture test #4110 on test/e2e/test-hermes-inference-switch.sh, and PR fix(e2e): handle top-level payloads in openclaw agent JSON output #4030 on test/e2e/test-openclaw-inference-switch.sh. Drift evidence confirms the changed files still exist.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

coderabbitai

🧹 Nitpick comments (1)

test/e2e/test-hermes-inference-switch.sh (1)

47-77: ⚡ Quick win

Consider extracting retry helpers to a shared library.

The is_transient_inference_set_failure() and run_inference_set_with_retry() functions are duplicated between this file and test/e2e/test-openclaw-inference-switch.sh (lines 47-77 in both). The only difference is the inference set command invocation (nemohermes vs nemoclaw).

Extracting these to a shared library (e.g., test/e2e/lib/inference-switch-retry.sh) with a parameterized command would eliminate duplication and make future maintenance easier.

♻️ Example extraction approach

In test/e2e/lib/inference-switch-retry.sh:

is_transient_inference_set_failure() {
  grep -qiE 'timed? out|timeout|ETIMEDOUT|ECONNRESET|EAI_AGAIN|ENOTFOUND|502|503|504|temporar' <<<"$1"
}

run_inference_set_with_retry() {
  local cmd="$1"
  shift
  local attempt rc output fallback_output
  local attempts="${NEMOCLAW_SWITCH_SET_ATTEMPTS:-3}"
  for attempt in $(seq 1 "$attempts"); do
    output=$("$cmd" "$@" 2>&1)
    rc=$?
    # ... rest of logic
  done
}

Then source and invoke with:

run_inference_set_with_retry nemohermes inference set --provider "$SWITCH_PROVIDER" --model "$SWITCH_MODEL"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-hermes-inference-switch.sh` around lines 47 - 77, Extract the
duplicated functions is_transient_inference_set_failure and
run_inference_set_with_retry into a shared script (e.g.,
test/e2e/lib/inference-switch-retry.sh), make run_inference_set_with_retry
accept the full command and its args (e.g., run_inference_set_with_retry
nemohermes inference set --provider "$SWITCH_PROVIDER" --model "$SWITCH_MODEL"),
preserve the existing behavior including using NEMOCLAW_SWITCH_SET_ATTEMPTS and
the fallback invocation with --no-verify, and update both
test/e2e/test-hermes-inference-switch.sh and
test/e2e/test-openclaw-inference-switch.sh to source the new lib and call the
parameterized run_inference_set_with_retry instead of their local copies of
nemohermes/nemoclaw-specific implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-hermes-inference-switch.sh`:
- Around line 47-77: Extract the duplicated functions
is_transient_inference_set_failure and run_inference_set_with_retry into a
shared script (e.g., test/e2e/lib/inference-switch-retry.sh), make
run_inference_set_with_retry accept the full command and its args (e.g.,
run_inference_set_with_retry nemohermes inference set --provider
"$SWITCH_PROVIDER" --model "$SWITCH_MODEL"), preserve the existing behavior
including using NEMOCLAW_SWITCH_SET_ATTEMPTS and the fallback invocation with
--no-verify, and update both test/e2e/test-hermes-inference-switch.sh and
test/e2e/test-openclaw-inference-switch.sh to source the new lib and call the
parameterized run_inference_set_with_retry instead of their local copies of
nemohermes/nemoclaw-specific implementations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a928bf82-f3ef-412e-b2c4-efad71a08cbe

📥 Commits

Reviewing files that changed from the base of the PR and between 68e5126 and 26fdb76.

📒 Files selected for processing (2)

test/e2e/test-hermes-inference-switch.sh
test/e2e/test-openclaw-inference-switch.sh

…h-retry-fallback

cv · 2026-05-24T07:58:52Z

Addressed feedback:

Extracted the duplicated retry/fallback logic into test/e2e/lib/inference-switch-retry.sh and sourced it from both OpenClaw and Hermes inference-switch E2Es.
Added validation for NEMOCLAW_SWITCH_SET_ATTEMPTS so invalid values fail explicitly before retrying.
Rebased/merged current main after fix(docker): retry gosu release download #4150 landed; PR diff is now limited to the inference-switch E2E scripts plus the shared helper.
The remaining overlap advisory is understood: this PR intentionally keeps the safer behavior of verified retries first, then a single --no-verify fallback only for transient verification failures. That should supersede the older draft PRs that pass --no-verify directly.

Validation:

bash -n for the changed shell scripts and helper.
Local helper smoke tests for invalid attempt count, successful retry, and --no-verify fallback.
npx prek run --all-files passed.
npm test passed on rerun.

Optional selective E2E is running for hermes-inference-switch-e2e,openclaw-inference-switch-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26355753234

github-actions · 2026-05-24T08:11:02Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26355753234
Target ref: c4383674b3d5719b8ae2cf08f1b31628b3f79ff0
Workflow ref: main
Requested jobs: hermes-inference-switch-e2e,openclaw-inference-switch-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job	Result
hermes-inference-switch-e2e	✅ success
openclaw-inference-switch-e2e	✅ success

cv added 3 commits May 24, 2026 00:04

fix(docker): retry gosu release download

e450d09

fix(docker): harden gosu curl download

68e5126

test(e2e): retry inference switch verification

26fdb76

cv self-assigned this May 24, 2026

coderabbitai Bot reviewed May 24, 2026

View reviewed changes

cv mentioned this pull request May 24, 2026

test(e2e): rely on Kimi trajectory acceptance #4153

Merged

12 tasks

cv added the v0.0.51 Release target label May 24, 2026

cv changed the base branch from fix/retry-gosu-download to main May 24, 2026 07:44

cv marked this pull request as ready for review May 24, 2026 07:44

cv added 2 commits May 24, 2026 00:46

Merge remote-tracking branch 'origin/main' into draft/inference-switc…

95f9304

…h-retry-fallback

test(e2e): share inference switch retry helper

c438367

cv merged commit 51efc4f into main May 24, 2026
31 checks passed

This was referenced May 25, 2026

nightly-e2e: hermes-inference-switch-e2e fails — inference verify timeout on z-ai/glm-5.1 #4111

Closed

fix(e2e): skip redundant inference verify in hermes-inference-switch #4109

Closed

jyaunches mentioned this pull request May 26, 2026

test(e2e): migrate Hermes feature coverage to scenario suites #3811

Closed

cv deleted the draft/inference-switch-retry-fallback branch May 27, 2026 21:16

Conversation

cv commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of Change

Verification

Uh oh!

copy-pr-bot Bot commented May 24, 2026

Uh oh!

coderabbitai Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cv commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cv commented May 24, 2026 •

edited

Loading

coderabbitai Bot commented May 24, 2026 •

edited

Loading

github-actions Bot commented May 24, 2026 •

edited

Loading

github-actions Bot commented May 24, 2026 •

edited

Loading

github-actions Bot commented May 24, 2026 •

edited

Loading