fix(model): retry inference probe on transient network errors#552
Open
bussyjd wants to merge 1 commit into
Open
fix(model): retry inference probe on transient network errors#552bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
obol model setup custom validates a candidate LLM endpoint by POSTing a
1-token chat completion. The probe was one-shot: any client.Do error
(DNS flake, TCP reset, momentary route loss) failed the whole
validation, surfacing in release-smoke flow-04 step 2 as:
✗ endpoint validation failed: inference probe failed —
cannot reach http://silvermesh.v1337.lan:8081/v1/chat/completions:
Post ...: cannot reach
Reproduced 2026-05-25 — the exact same POST returned HTTP 200 minutes
later from the same host. No code bug on either side, just a transient
route flake.
Add a bounded retry around client.Do (3 attempts, 250ms · 1s · 4s
backoff). Retry ONLY on Go-level network errors. Non-2xx HTTP responses
are real upstream signals (4xx = config bug, 5xx = upstream broken) and
still fail fast — retry won't help.
Tests inject a no-op sleep via package-level probeBackoffSleep var.
Three new tests cover the retry table, non-2xx no-retry, and invalid
response body no-retry.
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`obol model setup custom` validates a candidate LLM endpoint by POSTing a 1-token chat completion. The probe was one-shot: any `client.Do` error (DNS flake, TCP reset, momentary route loss) failed the whole validation.
This hardens `internal/model/model.go::ValidateCustomEndpoint` by adding a bounded retry on Go-level network errors only.
Bug repro
release-smoke flow-04 step 2 against silvermesh on 2026-05-25:
```
✗ endpoint validation failed: inference probe failed —
cannot reach http://silvermesh.v1337.lan:8081/v1/chat/completions:
Post ...: cannot reach
```
The exact same POST returned HTTP 200 minutes later from the same host. Not a code bug on either side — a transient route flake from the Mac host. The strict one-shot probe turned that flake into a release-gate failure.
Fix
Bounded retry around `client.Do` (3 attempts, 250ms · 1s · 4s backoff). Retry only on Go-level network errors. Non-2xx HTTP responses are real upstream signals and still fail fast.
Why retry only on net errors
Footprint
Tests stub `probeBackoffSleep` to no-op so they don't wait. Transient errors are simulated via `panic(http.ErrAbortHandler)` in the test server which makes `client.Do` return a Go-level error (same shape as a real network flake).
Test plan
Notes