Skip to content

test(e2e): retry inference switch verification#4152

Merged
cv merged 5 commits into
mainfrom
draft/inference-switch-retry-fallback
May 24, 2026
Merged

test(e2e): retry inference switch verification#4152
cv merged 5 commits into
mainfrom
draft/inference-switch-retry-fallback

Conversation

@cv
Copy link
Copy Markdown
Collaborator

@cv cv commented May 24, 2026

Summary

The nightly flake sweep showed openclaw-inference-switch-e2e and hermes-inference-switch-e2e as high-frequency failures, mostly due to transient live endpoint verification timeouts during inference set. This PR retries transient verification failures and falls back to --no-verify only after repeated transient failures, while keeping the later route/config/live-request assertions as the real correctness gate.

Changes

  • Add shared inference-switch retry helpers in test/e2e/lib/inference-switch-retry.sh.
  • Validate NEMOCLAW_SWITCH_SET_ATTEMPTS as a positive integer before retrying.
  • Retry verified nemoclaw inference set / nemohermes inference set attempts when failures look transient.
  • After repeated transient verification failures, retry once with --no-verify; subsequent route, config, sandbox inference, and agent/API request checks still validate the switched route.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

@cv cv self-assigned this May 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 24, 2026

📝 Walkthrough

Walkthrough

A new shared Bash E2E helper library detects transient inference-switch failures (timeouts, connection errors, DNS issues, 5xx responses) and retries failed commands up to a configurable limit with linear backoff. Both Hermes and OpenClaw test scripts now integrate this helper, replacing direct inference set invocations with a retry wrapper during Phase 3.

Changes

Inference switch retry resilience

Layer / File(s) Summary
Transient failure detection and retry helper
test/e2e/lib/inference-switch-retry.sh
New helper module implements regex-based transient failure classification (timeouts, connection/reset/DNS errors, 5xx responses), configurable retry loop with linear backoff, optional logging via existing info function, and --no-verify fallback for final transient failures.
Hermes inference switch retry integration
test/e2e/test-hermes-inference-switch.sh
Sources the retry helper and replaces direct nemohermes inference set invocation in Phase 3 with run_inference_set_with_retry, maintaining existing exit code capture and post-switch assertions.
OpenClaw inference switch retry integration
test/e2e/test-openclaw-inference-switch.sh
Sources the retry helper and replaces direct nemoclaw inference set invocation in Phase 3 with run_inference_set_with_retry, maintaining existing exit code capture and route/config assertions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

E2E, CI/CD, status: rfr

Suggested reviewers

  • jyaunches

Poem

🐰 A hop, a retry, with wisdom so sweet,
Transient troubles now beat a retreat!
Backoff and fallback, through timeouts we race,
Inference switches now land with such grace. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main change: adding retry logic for inference switch verification in E2E tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch draft/inference-switch-retry-fallback

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

E2E Advisor Recommendation

Required E2E: None
Optional E2E: hermes-inference-switch-e2e, openclaw-inference-switch-e2e

Dispatch hint: hermes-inference-switch-e2e,openclaw-inference-switch-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None.

Optional E2E

  • hermes-inference-switch-e2e (high; live cloud inference and sandbox install, timeout 60 minutes): Optional confidence check for the changed Hermes inference-switch E2E script and new shared retry helper; validates the helper works in the real workflow while preserving final route/config/live inference assertions.
  • openclaw-inference-switch-e2e (high; live cloud inference and sandbox install, timeout 45 minutes): Optional confidence check for the changed OpenClaw inference-switch E2E script and new shared retry helper; validates retry/fallback behavior around nemoclaw inference set plus live sandbox and agent requests after the switch.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: hermes-inference-switch-e2e,openclaw-inference-switch-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

🌱 Nice ideas

  • None.
Since last review details

Current findings:

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/test-hermes-inference-switch.sh (1)

47-77: ⚡ Quick win

Consider extracting retry helpers to a shared library.

The is_transient_inference_set_failure() and run_inference_set_with_retry() functions are duplicated between this file and test/e2e/test-openclaw-inference-switch.sh (lines 47-77 in both). The only difference is the inference set command invocation (nemohermes vs nemoclaw).

Extracting these to a shared library (e.g., test/e2e/lib/inference-switch-retry.sh) with a parameterized command would eliminate duplication and make future maintenance easier.

♻️ Example extraction approach

In test/e2e/lib/inference-switch-retry.sh:

is_transient_inference_set_failure() {
  grep -qiE 'timed? out|timeout|ETIMEDOUT|ECONNRESET|EAI_AGAIN|ENOTFOUND|502|503|504|temporar' <<<"$1"
}

run_inference_set_with_retry() {
  local cmd="$1"
  shift
  local attempt rc output fallback_output
  local attempts="${NEMOCLAW_SWITCH_SET_ATTEMPTS:-3}"
  for attempt in $(seq 1 "$attempts"); do
    output=$("$cmd" "$@" 2>&1)
    rc=$?
    # ... rest of logic
  done
}

Then source and invoke with:

run_inference_set_with_retry nemohermes inference set --provider "$SWITCH_PROVIDER" --model "$SWITCH_MODEL"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-hermes-inference-switch.sh` around lines 47 - 77, Extract the
duplicated functions is_transient_inference_set_failure and
run_inference_set_with_retry into a shared script (e.g.,
test/e2e/lib/inference-switch-retry.sh), make run_inference_set_with_retry
accept the full command and its args (e.g., run_inference_set_with_retry
nemohermes inference set --provider "$SWITCH_PROVIDER" --model "$SWITCH_MODEL"),
preserve the existing behavior including using NEMOCLAW_SWITCH_SET_ATTEMPTS and
the fallback invocation with --no-verify, and update both
test/e2e/test-hermes-inference-switch.sh and
test/e2e/test-openclaw-inference-switch.sh to source the new lib and call the
parameterized run_inference_set_with_retry instead of their local copies of
nemohermes/nemoclaw-specific implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-hermes-inference-switch.sh`:
- Around line 47-77: Extract the duplicated functions
is_transient_inference_set_failure and run_inference_set_with_retry into a
shared script (e.g., test/e2e/lib/inference-switch-retry.sh), make
run_inference_set_with_retry accept the full command and its args (e.g.,
run_inference_set_with_retry nemohermes inference set --provider
"$SWITCH_PROVIDER" --model "$SWITCH_MODEL"), preserve the existing behavior
including using NEMOCLAW_SWITCH_SET_ATTEMPTS and the fallback invocation with
--no-verify, and update both test/e2e/test-hermes-inference-switch.sh and
test/e2e/test-openclaw-inference-switch.sh to source the new lib and call the
parameterized run_inference_set_with_retry instead of their local copies of
nemohermes/nemoclaw-specific implementations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a928bf82-f3ef-412e-b2c4-efad71a08cbe

📥 Commits

Reviewing files that changed from the base of the PR and between 68e5126 and 26fdb76.

📒 Files selected for processing (2)
  • test/e2e/test-hermes-inference-switch.sh
  • test/e2e/test-openclaw-inference-switch.sh

@cv cv added the v0.0.51 Release target label May 24, 2026
@cv cv changed the base branch from fix/retry-gosu-download to main May 24, 2026 07:44
@cv cv marked this pull request as ready for review May 24, 2026 07:44
@cv
Copy link
Copy Markdown
Collaborator Author

cv commented May 24, 2026

Addressed feedback:

  • Extracted the duplicated retry/fallback logic into test/e2e/lib/inference-switch-retry.sh and sourced it from both OpenClaw and Hermes inference-switch E2Es.
  • Added validation for NEMOCLAW_SWITCH_SET_ATTEMPTS so invalid values fail explicitly before retrying.
  • Rebased/merged current main after fix(docker): retry gosu release download #4150 landed; PR diff is now limited to the inference-switch E2E scripts plus the shared helper.
  • The remaining overlap advisory is understood: this PR intentionally keeps the safer behavior of verified retries first, then a single --no-verify fallback only for transient verification failures. That should supersede the older draft PRs that pass --no-verify directly.

Validation:

  • bash -n for the changed shell scripts and helper.
  • Local helper smoke tests for invalid attempt count, successful retry, and --no-verify fallback.
  • npx prek run --all-files passed.
  • npm test passed on rerun.

Optional selective E2E is running for hermes-inference-switch-e2e,openclaw-inference-switch-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26355753234

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26355753234
Target ref: c4383674b3d5719b8ae2cf08f1b31628b3f79ff0
Workflow ref: main
Requested jobs: hermes-inference-switch-e2e,openclaw-inference-switch-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
hermes-inference-switch-e2e ✅ success
openclaw-inference-switch-e2e ✅ success

@cv cv merged commit 51efc4f into main May 24, 2026
31 checks passed
@cv cv deleted the draft/inference-switch-retry-fallback branch May 27, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants