Skip to content

feat: add --prompt-retries for automatic retry on transient errors#142

Closed
lupuletic wants to merge 4 commits intoopenclaw:mainfrom
lupuletic:fix/prompt-retry-on-transient-errors
Closed

feat: add --prompt-retries for automatic retry on transient errors#142
lupuletic wants to merge 4 commits intoopenclaw:mainfrom
lupuletic:fix/prompt-retry-on-transient-errors

Conversation

@lupuletic
Copy link
Copy Markdown

Summary

  • Add --prompt-retries <count> CLI flag to automatically retry failed prompt turns on transient ACP errors
  • Add isRetryablePromptError() to classify transient vs permanent errors — retries only on ACP -32603 (internal error) and -32700 (parse error)
  • Implement exponential backoff: 1s, 2s, 4s, 8s, capped at 10s
  • Never retry auth, permission, timeout, or session-not-found errors
  • Skip retry when the agent process crashed during the prompt
  • Default: 0 retries (preserving existing behavior)

Closes #137

How it works

When a model API returns a transient error (e.g. HTTP 400/500), the ACP agent wraps it as a JSON-RPC -32603 internal error. With --prompt-retries 3, acpx will:

  1. Detect the error is transient via isRetryablePromptError()
  2. Wait with exponential backoff (1s → 2s → 4s, capped at 10s)
  3. Re-send the prompt to the still-running agent
  4. Log each retry attempt to stderr
# Enable 3 retries for flaky model APIs
acpx --prompt-retries 3 claude prompt "Hello"

# Also works with exec mode
acpx --prompt-retries 2 claude exec "Analyze this code"

Test plan

  • 13 new unit tests for isRetryablePromptError() covering all error categories
  • Full test suite passing (263 tests, 0 failures)
  • TypeScript type check clean
  • Linting + formatting clean
  • Pre-commit hooks pass

Note: pnpm run check fails at the repo-wide coverage gate due to pre-existing baseline coverage totals unrelated to this change (same issue noted in PR #128).

AI-assisted PR

  • Marked as AI-assisted
  • Fully tested locally (pnpm build && pnpm test && pnpm lint)
  • I understand what the code does

🤖 Generated with Claude Code

lupuleticatalin and others added 4 commits March 15, 2026 21:35
When model APIs return transient errors (HTTP 400/500 surfacing as ACP
internal errors), acpx now supports automatic retry with exponential
backoff via the --prompt-retries flag.

- Add isRetryablePromptError() to classify transient vs permanent errors
- Retry only on ACP -32603 (internal) and -32700 (parse) errors
- Never retry auth, permission, timeout, or session-not-found errors
- Exponential backoff: 1s, 2s, 4s, 8s, capped at 10s
- Skip retry if agent process crashed during prompt
- Flow --prompt-retries through CLI → queue owner → prompt execution
- Default: 0 (no retries, preserving existing behavior)

Closes openclaw#137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dutifulbob
Copy link
Copy Markdown
Collaborator

Triage result

Human attention: ⚠️ Required
Recommendation: 🏁 escalate to a human
Human decision needed: ready for human landing decision

Quick read

This PR adds an opt-in --prompt-retries mechanism so transient ACP/model failures can recover without manually resending the prompt. The solution shape is good, targeted validation passed, blocking Codex findings were fixed, the final conflict gate is clean, and CI is green after approving the blocked workflow rerun.

Intent

Add a configurable automatic retry mechanism for acpx prompt and exec turns so transient ACP internal/model API failures can recover automatically instead of aborting the turn and forcing a manual retry.

Why

Issue #137 reports transient 400/500-style model/API failures surfacing as ACP internal errors and interrupting prompt/session flow even when retrying would often recover. This PR addresses that by adding retry classification plus exponential backoff in the main prompt paths while excluding clearly permanent failures.

Codex review

GitHub Codex review data for the current head was empty; the local Codex review was established and used as the primary review evidence.

Blocking issues were fixed before handoff:

  • Prevented retries from replaying a prompt after that attempt had already produced side effects or streamed partial output.
  • Preserved --json-strict behavior by suppressing retry notices on stderr in strict JSON retry flows.

Remaining Codex notes are non-blocking follow-ups in queued session mode:

  • Per-command --prompt-retries is still sticky to an already-running warm queue owner.
  • Deferred cancels during retry backoff are not reapplied before the next queued retry attempt.

CI/CD

CI status: green for the current head.

Approval-blocked workflow handling:

  • Approved GitHub Actions run 23687720407 when it surfaced as action_required.
  • Watched rerun attempt 2 complete successfully for head 916b8d00efbbe1e86cf7b759640676208ec74ca7.

Passing jobs:

  • scope
  • Format
  • Lint
  • Typecheck
  • Build
  • Test
  • Conformance Smoke

Other notes:

  • Docs was skipped by scope detection.
  • Initial conflict check was clean.
  • Final conflict gate is clean.

Recommendation

Ready for human landing decision.

If you are comfortable landing with the remaining non-blocking queued-session follow-ups deferred, this PR looks safe to continue: intent is correct, blocking review issues were addressed, targeted validation passed, and CI is green.

@osolmaz osolmaz closed this in #196 Mar 29, 2026
osolmaz added a commit that referenced this pull request Mar 29, 2026
)

* feat: add --prompt-retries flag for automatic retry on transient errors

When model APIs return transient errors (HTTP 400/500 surfacing as ACP
internal errors), acpx now supports automatic retry with exponential
backoff via the --prompt-retries flag.

- Add isRetryablePromptError() to classify transient vs permanent errors
- Retry only on ACP -32603 (internal) and -32700 (parse) errors
- Never retry auth, permission, timeout, or session-not-found errors
- Exponential backoff: 1s, 2s, 4s, 8s, capped at 10s
- Skip retry if agent process crashed during prompt
- Flow --prompt-retries through CLI → queue owner → prompt execution
- Default: 0 (no retries, preserving existing behavior)

Closes #137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix prompt retry replay after partial output

* Honor strict JSON during prompt retries

* docs(changelog): note prompt retry support (#142)

---------

Co-authored-by: Catalin Lupuleti <catalin.lupuleti@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: bob <dutifulbob@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It is necessary to automatically retry the function when the model api returns 400.

3 participants