fix(httpserver): quiet client-disconnect log path, return 499#1288
Open
sufubao wants to merge 7 commits intoModelTC:mainfrom
Open
fix(httpserver): quiet client-disconnect log path, return 499#1288sufubao wants to merge 7 commits intoModelTC:mainfrom
sufubao wants to merge 7 commits intoModelTC:mainfrom
Conversation
Introduce ClientDisconnected exception class and use it instead of a generic Exception when the polling loop sees the client gone or another module aborted the request. Special-case it in HttpServerManager.generate (log a one-liner, skip the duplicate self.abort) and in the FastAPI route handlers + SSE wrapper (return 499, swallow at the stream tail). This collapses the prior aborted-request log from a multi-line Python traceback to a single WARNING line.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a custom ClientDisconnected exception to handle client disconnects gracefully across the server. It updates the HTTP API handlers to return a 499 status code and modifies the server managers to catch this exception for quiet logging, replacing generic exceptions previously used for disconnects. The review feedback correctly identifies that in HttpServerManagerForPDMaster, the abort() method should be explicitly called before raising ClientDisconnected to ensure immediate resource release on remote nodes, maintaining consistency with the HttpServerManager implementation.
… debug Address review feedback on ModelTC#1288: - HttpServerManagerForPDMaster: call self.abort(p_node, d_node) before raising ClientDisconnected at every detection site so remote P/D nodes release resources immediately, matching HttpServerManager's behavior. - Add except ClientDisconnected branch in HttpServerManagerForPDMaster.generate so disconnects no longer fall through the generic BaseException handler that logs ERROR + traceback. - Add logger.debug(..., exc_info=True) in both managers' ClientDisconnected handlers so ops can pinpoint which phase tripped the disconnect when DEBUG logging is enabled.
…vent waits Address further review feedback on ModelTC#1288: - api_http.get_score: add except ClientDisconnected branch before the generic handler so a client disconnect returns 499 instead of 417, matching the /generate behavior. lightllm_get_score also drives httpserver_manager.generate so it can raise ClientDisconnected the same way. - HttpServerManagerForPDMaster: add _wait_event_with_disconnect_check helper that waits on an asyncio.Event with the original 60s budget but polls request.is_disconnected() every 1s, aborting P/D nodes immediately on disconnect. Use it at the three asyncio.wait_for(event.wait(), timeout=60) sites (fetch_stream upkv wait, fetch_nixl_stream np-prompt-ids wait, and fetch_nixl_stream upkv wait). Previously a disconnect during these waits could leave PD nodes holding work for up to 60s.
…sconnects req_status.aborted can be set by internal module failures (e.g. the visual proxy marks shm_req.is_aborted = True on exceptions, which the recycle loop converts to req_status.aborted), not just by client disconnects. The earlier quiet-disconnect refactor mapped that branch to ClientDisconnected, which caused real server-side failures to be swallowed as 499 in the API handlers. Restore the pre-refactor behavior for that branch: raise a generic Exception so it falls through HttpServerManager.generate's existing handler (logs ERROR + releases multimodal resources) and surfaces to the client as 417 instead of 499. ClientDisconnected is now reserved for request.is_disconnected() paths only; tighten its docstring to reflect that.
… preload
Previously fetch_resource raised a generic Exception on client disconnect
during URL download, which AudioItem.preload / ImageItem.preload then wrapped
into ValueError. That happened before HttpServerManager.generate's polling
loop, so a real request.is_disconnected() during multimodal download took the
noisy error path (logger.error + 400/417) instead of the new quiet
ClientDisconnected/499 contract.
- multimodal_utils.fetch_resource: raise ClientDisconnected on disconnect.
- multimodal_params.{AudioItem,ImageItem}.preload: re-raise ClientDisconnected
before the generic except branch so it isn't wrapped as ValueError.
- api_http /tokens: add except ClientDisconnected -> 499 ahead of the generic
handler (verify_and_preload runs there too).
- error_utils.ClientDisconnected: make group_request_id optional (default
None) so call sites without that context, like fetch_resource, can raise it.
The exception message and PR commit history (39a4e80) carry the same rationale; the comment was redundant per review feedback.
针对 review 中 Q4/Q5 的疑问加注释,把两处修复的旧 bug 写清楚: - _wait_event_with_disconnect_check: 解释原先 60s wait_for 的盲区, 以及现在按 poll_interval 切片轮询断连后立刻 abort + raise ClientDisconnected 的语义,最坏占资源时间从 60s 降到 ~1s。 - fetch_stream 第一处 abort + raise 站点(作为同模式的代表性入口): 解释原先 raise Exception 只让 master 退出、P/D 节点继续占资源的 bug,以及为什么必须先 self.abort(p_node, d_node) 发 ABORT_REQ 再 raise ClientDisconnected。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ClientDisconnectedexception (lightllm/utils/error_utils.py) and raise it from the polling loops in bothhttpserver/manager.pyandhttpserver_for_pd_master/manager.pyinstead of genericException("... disconnected").ClientDisconnectedahead of the generic handler inHttpServerManager.generate: log a single WARNING line and re-raise without callingself.abortagain (the disconnect path already aborted).ClientDisconnectedin the FastAPI route handlers (/generate,/generate_stream,/v1/chat/completions,/v1/completions,/v1/messages) and return HTTP 499 instead of falling through to the genericExceptionhandler that emits 417 + a traceback.ClientDisconnectedin the SSE wrapper (api_openai._safe_stream_wrapper) and exit quietly so a closed client doesn't trigger a stream-tail traceback.Context
The underlying root cause —
BaseHTTPMiddleware's access-log decorator swallowinghttp.disconnectsoRequest.is_disconnected()never flipped — was already fixed upstream in #1282 (the ASGI-class_AccessLogMiddleware). Once disconnect detection actually fires, the existing code path raises a genericExceptionthat falls through every layer's logger; this PR is the cleanup that follows from that fix.Server logs (before vs after)
Before
After
Test plan
python -c "import ast; [ast.parse(open(f).read()) for f in (...)]"passes on all five touched files./generatewith a long-running prompt, close the client mid-stream, confirm server logs match the "After" block above and that no traceback is emitted./v1/chat/completionswithstream=true) — confirm SSE wrapper exits without dumping a traceback.ClientDisconnectedinstead ofException.