Skip to content

fix(httpserver): quiet client-disconnect log path, return 499#1288

Open
sufubao wants to merge 7 commits intoModelTC:mainfrom
sufubao:httpserver-abort-on-disconnect
Open

fix(httpserver): quiet client-disconnect log path, return 499#1288
sufubao wants to merge 7 commits intoModelTC:mainfrom
sufubao:httpserver-abort-on-disconnect

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented May 6, 2026

Summary

  • Add ClientDisconnected exception (lightllm/utils/error_utils.py) and raise it from the polling loops in both httpserver/manager.py and httpserver_for_pd_master/manager.py instead of generic Exception("... disconnected").
  • Catch ClientDisconnected ahead of the generic handler in HttpServerManager.generate: log a single WARNING line and re-raise without calling self.abort again (the disconnect path already aborted).
  • Catch ClientDisconnected in the FastAPI route handlers (/generate, /generate_stream, /v1/chat/completions, /v1/completions, /v1/messages) and return HTTP 499 instead of falling through to the generic Exception handler that emits 417 + a traceback.
  • Catch ClientDisconnected in the SSE wrapper (api_openai._safe_stream_wrapper) and exit quietly so a closed client doesn't trigger a stream-tail traceback.

Context

The underlying root cause — BaseHTTPMiddleware's access-log decorator swallowing http.disconnect so Request.is_disconnected() never flipped — was already fixed upstream in #1282 (the ASGI-class _AccessLogMiddleware). Once disconnect detection actually fires, the existing code path raises a generic Exception that falls through every layer's logger; this PR is the cleanup that follows from that fix.

Server logs (before vs after)

Before

WARNING 05-06 02:51:38 [manager.py:767] aborted group_request_id 3912
ERROR 05-06 02:51:38 [manager.py:449] group_request_id: 3912 has exception req_id 3912 disconnected
WARNING 05-06 02:51:38 [manager.py:767] aborted group_request_id 3912
ERROR 05-06 02:51:38 [api_http.py:248] An error occurred: req_id 3912 disconnected
ERROR 05-06 02:51:38 [api_http.py:248] Traceback (most recent call last):
ERROR 05-06 02:51:38 [api_http.py:248]   File "/lightllm/lightllm/server/api_http.py", line 241, in generate
ERROR 05-06 02:51:38 [api_http.py:248]     return await g_objs.g_generate_func(request, g_objs.httpserver_manager)
ERROR 05-06 02:51:38 [api_http.py:248]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-06 02:51:38 [api_http.py:248]   File "/lightllm/lightllm/server/api_lightllm.py", line 56, in lightllm_generate
ERROR 05-06 02:51:38 [api_http.py:248]     async for sub_req_id, request_output, metadata, finish_status in results_generator:
ERROR 05-06 02:51:38 [api_http.py:248]   File "/lightllm/lightllm/server/httpserver/manager.py", line 457, in generate
ERROR 05-06 02:51:38 [api_http.py:248]     raise e
ERROR 05-06 02:51:38 [api_http.py:248]   File "/lightllm/lightllm/server/httpserver/manager.py", line 440, in generate
ERROR 05-06 02:51:38 [api_http.py:248]     async for sub_req_id, request_output, metadata, finish_status in results_generator:
ERROR 05-06 02:51:38 [api_http.py:248]   File "/lightllm/lightllm/server/httpserver/manager.py", line 667, in _wait_to_token_package
ERROR 05-06 02:51:38 [api_http.py:248]     raise Exception(f"req_id {group_request_id} disconnected")
ERROR 05-06 02:51:38 [api_http.py:248] Exception: req_id 3912 disconnected

After

WARNING 05-06 02:57:24 [manager.py:774] aborted group_request_id 1792
WARNING 05-06 02:57:24 [manager.py:453] group_request_id: 1792 client disconnected
INFO 05-06 02:57:24 [api_http.py:149] POST /generate 499

Test plan

  • python -c "import ast; [ast.parse(open(f).read()) for f in (...)]" passes on all five touched files.
  • Manual: hit /generate with a long-running prompt, close the client mid-stream, confirm server logs match the "After" block above and that no traceback is emitted.
  • Manual: same drill against a streaming endpoint (/v1/chat/completions with stream=true) — confirm SSE wrapper exits without dumping a traceback.
  • Manual: PD-master mode — disconnect during prefill/decode and confirm pd-master manager raises ClientDisconnected instead of Exception.

Introduce ClientDisconnected exception class and use it instead of a
generic Exception when the polling loop sees the client gone or another
module aborted the request. Special-case it in HttpServerManager.generate
(log a one-liner, skip the duplicate self.abort) and in the FastAPI route
handlers + SSE wrapper (return 499, swallow at the stream tail). This
collapses the prior aborted-request log from a multi-line Python
traceback to a single WARNING line.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom ClientDisconnected exception to handle client disconnects gracefully across the server. It updates the HTTP API handlers to return a 499 status code and modifies the server managers to catch this exception for quiet logging, replacing generic exceptions previously used for disconnects. The review feedback correctly identifies that in HttpServerManagerForPDMaster, the abort() method should be explicitly called before raising ClientDisconnected to ensure immediate resource release on remote nodes, maintaining consistency with the HttpServerManager implementation.

… debug

Address review feedback on ModelTC#1288:
- HttpServerManagerForPDMaster: call self.abort(p_node, d_node) before raising
  ClientDisconnected at every detection site so remote P/D nodes release
  resources immediately, matching HttpServerManager's behavior.
- Add except ClientDisconnected branch in HttpServerManagerForPDMaster.generate
  so disconnects no longer fall through the generic BaseException handler that
  logs ERROR + traceback.
- Add logger.debug(..., exc_info=True) in both managers' ClientDisconnected
  handlers so ops can pinpoint which phase tripped the disconnect when DEBUG
  logging is enabled.
@ModelTC ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026
@ModelTC ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026
@ModelTC ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026
@ModelTC ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026
@ModelTC ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026
sufubao added 5 commits May 6, 2026 17:22
…vent waits

Address further review feedback on ModelTC#1288:
- api_http.get_score: add except ClientDisconnected branch before the generic
  handler so a client disconnect returns 499 instead of 417, matching the
  /generate behavior. lightllm_get_score also drives httpserver_manager.generate
  so it can raise ClientDisconnected the same way.
- HttpServerManagerForPDMaster: add _wait_event_with_disconnect_check helper
  that waits on an asyncio.Event with the original 60s budget but polls
  request.is_disconnected() every 1s, aborting P/D nodes immediately on
  disconnect. Use it at the three asyncio.wait_for(event.wait(), timeout=60)
  sites (fetch_stream upkv wait, fetch_nixl_stream np-prompt-ids wait, and
  fetch_nixl_stream upkv wait). Previously a disconnect during these waits
  could leave PD nodes holding work for up to 60s.
…sconnects

req_status.aborted can be set by internal module failures (e.g. the visual
proxy marks shm_req.is_aborted = True on exceptions, which the recycle loop
converts to req_status.aborted), not just by client disconnects. The earlier
quiet-disconnect refactor mapped that branch to ClientDisconnected, which
caused real server-side failures to be swallowed as 499 in the API handlers.

Restore the pre-refactor behavior for that branch: raise a generic Exception
so it falls through HttpServerManager.generate's existing handler (logs ERROR
+ releases multimodal resources) and surfaces to the client as 417 instead
of 499. ClientDisconnected is now reserved for request.is_disconnected()
paths only; tighten its docstring to reflect that.
… preload

Previously fetch_resource raised a generic Exception on client disconnect
during URL download, which AudioItem.preload / ImageItem.preload then wrapped
into ValueError. That happened before HttpServerManager.generate's polling
loop, so a real request.is_disconnected() during multimodal download took the
noisy error path (logger.error + 400/417) instead of the new quiet
ClientDisconnected/499 contract.

- multimodal_utils.fetch_resource: raise ClientDisconnected on disconnect.
- multimodal_params.{AudioItem,ImageItem}.preload: re-raise ClientDisconnected
  before the generic except branch so it isn't wrapped as ValueError.
- api_http /tokens: add except ClientDisconnected -> 499 ahead of the generic
  handler (verify_and_preload runs there too).
- error_utils.ClientDisconnected: make group_request_id optional (default
  None) so call sites without that context, like fetch_resource, can raise it.
The exception message and PR commit history (39a4e80) carry the same
rationale; the comment was redundant per review feedback.
针对 review 中 Q4/Q5 的疑问加注释,把两处修复的旧 bug 写清楚:
- _wait_event_with_disconnect_check: 解释原先 60s wait_for 的盲区,
  以及现在按 poll_interval 切片轮询断连后立刻 abort + raise
  ClientDisconnected 的语义,最坏占资源时间从 60s 降到 ~1s。
- fetch_stream 第一处 abort + raise 站点(作为同模式的代表性入口):
  解释原先 raise Exception 只让 master 退出、P/D 节点继续占资源的
  bug,以及为什么必须先 self.abort(p_node, d_node) 发 ABORT_REQ
  再 raise ClientDisconnected。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant