Skip to content

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340

Open
deanq wants to merge 2 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry
Open

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
deanq wants to merge 2 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented May 25, 2026

Summary

  • EndpointJob.wait() previously called self.status() with zero exception handling, so a single transient httpx.RemoteProtocolError (or any transport/timeout failure) on the Runpod /v2/{id}/status/{job_id} poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.
  • The polling loop now catches httpx.TransportError and httpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-supplied timeout deadline is exceeded (still TimeoutError), or when _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.
  • httpx.HTTPStatusError (4xx auth/config bugs from raise_for_status) is intentionally NOT caught — it propagates immediately.
  • The user-space _wait_resilient workaround in flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py is now obsolete; cleanup of that file is intentionally out of scope for this PR.

Refs AE-3154.

Test plan

  • make quality-check passes (all tests + lint/format, coverage 85.45%).
  • New unit tests in tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:
    • transient RemoteProtocolError once, then COMPLETEDwait() returns normally (2 polls).
    • persistent RemoteProtocolErrorwait() re-raises after _POLL_MAX_CONSECUTIVE_ERRORS polls.
    • error / success / error burst / success / COMPLETED — counter resets, wait() completes.
    • HTTPStatusError(401) is NOT swallowed; re-raised on first call.
    • RemoteProtocolError forever + timeout=0.1wait() raises TimeoutError, not the httpx error.
  • Manual smoke against an endpoint with cold workers (vLLM): await job.wait() survives mid-poll TCP drops instead of aborting.

EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError
(or any other transient transport/timeout failure) raised by the Runpod
/v2/{id}/status/{job_id} poll, even though the underlying job was still
healthy. Multi-minute cold starts amplify this: one dropped poll fails a
five-minute wait that was nearly complete.

Catch httpx.TransportError and httpx.TimeoutException inside the polling
loop, log at debug, apply the existing exponential backoff, and continue.
Re-raise only when:
  - the user-supplied timeout deadline is exceeded (TimeoutError), or
  - _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead
    endpoints still fail loud.

The counter resets on any successful poll. httpx.HTTPStatusError (4xx
auth/config bugs) is intentionally NOT caught — it propagates immediately.

Refs AE-3154.
@promptless
Copy link
Copy Markdown

promptless Bot commented May 25, 2026

Promptless prepared a documentation update related to this change.

Triggered by runpod/flash PR #340

Documents that the Flash SDK's EndpointJob.wait() method now automatically retries transient network errors (connection drops, timeouts, protocol errors) with exponential backoff, making it resilient during cold starts. HTTP status errors like 401 and 404 fail immediately without retry.

Review: Document EndpointJob.wait() retry behavior for transient errors

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.

Changes:

  • Add transient-error retry handling to EndpointJob.wait() with exponential backoff and a maximum consecutive-error threshold.
  • Introduce _POLL_MAX_CONSECUTIVE_ERRORS to cap tolerated consecutive transient failures.
  • Add unit tests covering transient error retry, threshold behavior, counter reset, and HTTPStatusError propagation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/runpod_flash/endpoint.py Adds retry/backoff logic in EndpointJob.wait() for transient httpx transport/timeout errors with a consecutive-error threshold.
tests/unit/test_endpoint_client.py Adds unit tests validating retry behavior and error propagation for EndpointJob.wait().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit/test_endpoint_client.py Outdated
The previous test passed `timeout=0.1` against the default
`_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from
its pre-sleep deadline guard before ever calling `status()`. The retry
path was never exercised — the test only validated the pre-sleep guard.

Apply the `fast_poll` fixture and lift the consecutive-error threshold
above the number of retries the deadline allows, so multiple httpx
errors are actually suppressed before the deadline trips. Assert
`_api_get.call_count >= 2` to lock in that the retry path runs.

Surfaced by Copilot review on AE-3154.
@deanq deanq requested review from KAJdev, jhcipar and runpod-Henrik May 25, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants