fix(endpoint): retry EndpointJob.wait() on transient httpx errors by deanq · Pull Request #340 · runpod/flash

deanq · 2026-05-25T03:06:39Z

Summary

EndpointJob.wait() previously called self.status() with zero exception handling, so a single transient httpx.RemoteProtocolError (or any transport/timeout failure) on the Runpod /v2/{id}/status/{job_id} poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.
The polling loop now catches httpx.TransportError and httpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-supplied timeout deadline is exceeded (still TimeoutError), or when _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.
httpx.HTTPStatusError (4xx auth/config bugs from raise_for_status) is intentionally NOT caught — it propagates immediately.
The user-space _wait_resilient workaround in flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py is now obsolete; cleanup of that file is intentionally out of scope for this PR.

Test plan

make quality-check passes (all tests + lint/format, coverage 85.45%).
New unit tests in tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:
- transient RemoteProtocolError once, then COMPLETED — wait() returns normally (2 polls).
- persistent RemoteProtocolError — wait() re-raises after _POLL_MAX_CONSECUTIVE_ERRORS polls.
- error / success / error burst / success / COMPLETED — counter resets, wait() completes.
- HTTPStatusError(401) is NOT swallowed; re-raised on first call.
- RemoteProtocolError forever + timeout=0.1 — wait() raises TimeoutError, not the httpx error.
Manual smoke against an endpoint with cold workers (vLLM): await job.wait() survives mid-poll TCP drops instead of aborting.

EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError (or any other transient transport/timeout failure) raised by the Runpod /v2/{id}/status/{job_id} poll, even though the underlying job was still healthy. Multi-minute cold starts amplify this: one dropped poll fails a five-minute wait that was nearly complete. Catch httpx.TransportError and httpx.TimeoutException inside the polling loop, log at debug, apply the existing exponential backoff, and continue. Re-raise only when: - the user-supplied timeout deadline is exceeded (TimeoutError), or - _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead endpoints still fail loud. The counter resets on any successful poll. httpx.HTTPStatusError (4xx auth/config bugs) is intentionally NOT caught — it propagates immediately. Refs AE-3154.

promptless · 2026-05-25T03:11:37Z

Promptless prepared a documentation update related to this change.

Triggered by runpod/flash PR #340

Documents that the Flash SDK's EndpointJob.wait() method now automatically retries transient network errors (connection drops, timeouts, protocol errors) with exponential backoff, making it resilient during cold starts. HTTP status errors like 401 and 404 fail immediately without retry.

Review: Document EndpointJob.wait() retry behavior for transient errors

Copilot

Pull request overview

This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.

Changes:

Add transient-error retry handling to EndpointJob.wait() with exponential backoff and a maximum consecutive-error threshold.
Introduce _POLL_MAX_CONSECUTIVE_ERRORS to cap tolerated consecutive transient failures.
Add unit tests covering transient error retry, threshold behavior, counter reset, and HTTPStatusError propagation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/runpod_flash/endpoint.py`	Adds retry/backoff logic in `EndpointJob.wait()` for transient `httpx` transport/timeout errors with a consecutive-error threshold.
`tests/unit/test_endpoint_client.py`	Adds unit tests validating retry behavior and error propagation for `EndpointJob.wait()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous test passed `timeout=0.1` against the default `_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from its pre-sleep deadline guard before ever calling `status()`. The retry path was never exercised — the test only validated the pre-sleep guard. Apply the `fast_poll` fixture and lift the consecutive-error threshold above the number of retries the deadline allows, so multiple httpx errors are actually suppressed before the deadline trips. Assert `_api_get.call_count >= 2` to lock in that the retry path runs. Surfaced by Copilot review on AE-3154.

deanq requested a review from Copilot May 25, 2026 03:09

Copilot started reviewing on behalf of deanq May 25, 2026 03:09 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread tests/unit/test_endpoint_client.py Outdated

deanq requested review from KAJdev, jhcipar and runpod-Henrik May 25, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
deanq wants to merge 2 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry

deanq commented May 25, 2026

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deanq commented May 25, 2026

Summary

Test plan

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants