Skip to content

fix(model): retry inference probe on transient network errors#552

Open
bussyjd wants to merge 1 commit into
mainfrom
fix/probe-retry-network-errors
Open

fix(model): retry inference probe on transient network errors#552
bussyjd wants to merge 1 commit into
mainfrom
fix/probe-retry-network-errors

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 25, 2026

Summary

`obol model setup custom` validates a candidate LLM endpoint by POSTing a 1-token chat completion. The probe was one-shot: any `client.Do` error (DNS flake, TCP reset, momentary route loss) failed the whole validation.

This hardens `internal/model/model.go::ValidateCustomEndpoint` by adding a bounded retry on Go-level network errors only.

Bug repro

release-smoke flow-04 step 2 against silvermesh on 2026-05-25:

```
✗ endpoint validation failed: inference probe failed —
cannot reach http://silvermesh.v1337.lan:8081/v1/chat/completions:
Post ...: cannot reach
```

The exact same POST returned HTTP 200 minutes later from the same host. Not a code bug on either side — a transient route flake from the Mac host. The strict one-shot probe turned that flake into a release-gate failure.

Fix

Bounded retry around `client.Do` (3 attempts, 250ms · 1s · 4s backoff). Retry only on Go-level network errors. Non-2xx HTTP responses are real upstream signals and still fail fast.

Why retry only on net errors

Signal Meaning Action
`client.Do()` returns Go error DNS / TCP / TLS flake retry
HTTP 4xx bad model / auth / shape fail-fast (config bug)
HTTP 5xx upstream broken fail-fast (operator action)
HTTP 200 + empty choices mlx-lm-shape server existing error message
HTTP 200 + ≥1 choice real success

Footprint

  • `internal/model/model.go` — +25 LoC. Adds package-level `probeBackoffSleep` (overridable for tests) and the retry loop. Per-attempt: clone the request, re-attach the payload body (single-use), call `client.Do`.
  • `internal/model/model_test.go` — +3 tests:
    • `TestValidateCustomEndpoint_RetriesOnNetworkError` — table-driven: 0/1/2 transient errors then OK (within budget), 3 errors (budget exhausted, fails)
    • `TestValidateCustomEndpoint_NoRetryOnNon2xx` — 401/404/503 all fail-fast on 1 POST
    • `TestValidateCustomEndpoint_NoRetryOnInvalidResponseBody` — malformed JSON fails-fast on 1 POST

Tests stub `probeBackoffSleep` to no-op so they don't wait. Transient errors are simulated via `panic(http.ErrAbortHandler)` in the test server which makes `client.Do` return a Go-level error (same shape as a real network flake).

Test plan

  • `go build ./...` clean
  • `go test ./internal/model/...` green
  • Reviewer: run release-smoke flow-04 against an LLM endpoint that's reachable but occasionally drops connections (e.g. via toxiproxy or just network flakiness) — should now retry rather than fail outright

Notes

  • Retry budget is hardcoded (3 / 250ms / 1s / 4s). No CLI flag, no env var. If a future caller needs different policy, add an explicit ValidateCustomEndpointWithOptions variant.
  • Reachability check at the top of the function already has fallback paths (`/models` → `/health` → `/`), no extra retry needed there.
  • The buy.py path in flow-08 is a different code path and is NOT touched by this PR.

obol model setup custom validates a candidate LLM endpoint by POSTing a
1-token chat completion. The probe was one-shot: any client.Do error
(DNS flake, TCP reset, momentary route loss) failed the whole
validation, surfacing in release-smoke flow-04 step 2 as:

  ✗ endpoint validation failed: inference probe failed —
    cannot reach http://silvermesh.v1337.lan:8081/v1/chat/completions:
    Post ...: cannot reach

Reproduced 2026-05-25 — the exact same POST returned HTTP 200 minutes
later from the same host. No code bug on either side, just a transient
route flake.

Add a bounded retry around client.Do (3 attempts, 250ms · 1s · 4s
backoff). Retry ONLY on Go-level network errors. Non-2xx HTTP responses
are real upstream signals (4xx = config bug, 5xx = upstream broken) and
still fail fast — retry won't help.

Tests inject a no-op sleep via package-level probeBackoffSleep var.
Three new tests cover the retry table, non-2xx no-retry, and invalid
response body no-retry.
@bussyjd bussyjd mentioned this pull request May 25, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant