Logging colorization + access middleware cleanup + windowed cache stats#1289
Logging colorization + access middleware cleanup + windowed cache stats#1289sufubao wants to merge 7 commits intoModelTC:mainfrom
Conversation
…request logs - Add ANSI color codes to log level names (TTY only; plain in files) - Introduce SystemStatusReporter with windowed prefix-cache hit rate alongside the global rate, plus a more compact status line - Drop gunicorn --access-logfile flags (FastAPI middleware now handles it) - Remove duplicate _ACCESS_LOG_STATUS_COLORS declaration in api_http.py - Downgrade noisy per-request / per-batch progress logs from INFO to DEBUG - Fix flake8 F841 (unused exception variable) in detokenization manager
… 400 Reject prompts whose character length exceeds max_req_total_len * 8 before tokenization, so a long string can no longer reach the tokenizer and stall the loop. The raised ValueError is caught one level up: log it at WARNING, release any held multimodal resources, abort the in-flight group request, and re-raise so the API layer (which already maps ValueError to HTTP 400) returns a graceful error to the client instead of a 500.
There was a problem hiding this comment.
Code Review
This pull request introduces a SystemStatusReporter to provide detailed periodic metrics including TPS, request counts, KV cache utilization, and prefix cache hit rates. It also refines logging by implementing ANSI color support, adjusting log levels from info to debug for high-frequency events, and removing redundant access logs. Additionally, a prompt length validation check was added to the HTTP server, and CUDA graph capturing now features progress bars with memory usage information. Feedback identifies a potential UnboundLocalError in the HTTP server's error handling and notes a missing update to router_statics in the router manager, which is critical for scheduling accuracy.
| self.status_reporter.on_request_completed( | ||
| input_len=req.input_len, | ||
| output_len=req.shm_cur_output_len, | ||
| cache_len=req.prompt_cache_len, | ||
| mtp_accepted=req.mtp_accepted_token_num, | ||
| ) |
There was a problem hiding this comment.
The router_statics.update() call was removed from Batch.filter_out_finished_req but not added here. This prevents the EMA of output lengths from being updated, which is critical for accurate scheduling decisions. You should call it when a request completes to maintain correct system load estimation.
| self.status_reporter.on_request_completed( | |
| input_len=req.input_len, | |
| output_len=req.shm_cur_output_len, | |
| cache_len=req.prompt_cache_len, | |
| mtp_accepted=req.mtp_accepted_token_num, | |
| ) | |
| self.status_reporter.on_request_completed( | |
| input_len=req.input_len, | |
| output_len=req.shm_cur_output_len, | |
| cache_len=req.prompt_cache_len, | |
| mtp_accepted=req.mtp_accepted_token_num, | |
| ) | |
| self.router_statics.update(req.shm_cur_output_len) |
There was a problem hiding this comment.
Good catch — confirmed regression. Upstream's Batch.filter_out_finished_req(shm_req_manager, router_statics) called router_statics.update(req.candetoken_out_len) per finished req, and the refactor that hoisted the loop into _filter_reqs_from_running_batch only added status_reporter.on_request_completed(...) and dropped the EMA update. Since the schedulers (chunked_prefill/impl.py, beam_impl.py, impl_for_pd_decode.py, impl_for_nixl_pd.py, …) read self.router.router_statics.ema_req_out_len for KV-budget estimation, leaving it stale was a real correctness hit.
Restored in 3218df5 — kept req.candetoken_out_len (matching upstream's exact value) for the EMA update, alongside the new on_request_completed call which uses shm_cur_output_len for the windowed/global throughput counters.
The try/except keeps the detokenization loop alive when a single recv_obj is malformed, but the failed request currently hangs the originating client until it disconnects. Note the missing piece: publish an ERROR finish_status back to httpserver so the client gets a 500 right away.
The earlier refactor that moved the finished-req loop out of Batch.filter_out_finished_req into Router._filter_reqs_from_running_batch forgot to keep the router_statics.update(candetoken_out_len) call, freezing ema_req_out_len at its initial value. Multiple schedulers (chunked_prefill, beam, pd_decode, nixl_pd) read that EMA for KV-budget estimation, so leaving it stale degraded scheduling accuracy.
Two correctness fixes flagged in PR review: 1. count_output_tokens(len(running_batch.reqs)) once per router loop is wrong — the router loop polls on schedule_time_interval, decoupled from inference, so this overcounts when the loop is faster than decode and undercounts when slower, and includes paused/prefill-only reqs. Track shm_cur_output_len per request and accumulate the delta each tick (with a tail settlement when the req is filtered out so we don't lose its last tokens to the post-final-tick window). 2. on_request_completed() and router_statics.update() now both run for aborted requests, whose candetoken_out_len is a short partial value. Restore the prior `if not req.is_aborted` guard so disconnects don't bias the output-length EMA used by KV-budget estimators.
Move the per-running-req shm_cur_output_len delta tracking from the router tick (~33 Hz) into SystemStatusReporter.maybe_print, which only runs once per log_stats_interval (>= 5s). The reporter now owns the per-req snapshot dict and exposes discard_req(req) for tail settlement when a req leaves the running batch, so the router loop's hot path no longer walks the batch every schedule cycle. Output TPS accuracy is unchanged: still based on real shm_cur_output_len deltas, with tail tokens settled at completion.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Code review (correction)Previous comment missed one P1. Found 1 issue:
LightLLM/lightllm/server/detokenization/manager.py Lines 78 to 88 in a40264c For reference, the can-release contract: LightLLM/lightllm/server/core/objs/req.py Lines 100 to 108 in a40264c LightLLM/lightllm/server/core/objs/req.py Lines 294 to 309 in a40264c 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Code review (additional)Found 1 more issue:
config_server_start removes the gunicorn flag: LightLLM/lightllm/server/api_start.py Lines 601 to 613 in a40264c config server FastAPI app has no middleware: LightLLM/lightllm/server/config_server/api_http.py Lines 18 to 25 in a40264c middleware is only attached to the main app: LightLLM/lightllm/server/api_http.py Lines 149 to 153 in a40264c 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Code reviewFound 1 issue (re-review of
LightLLM/lightllm/server/detokenization/manager.py Lines 69 to 74 in 5ef655a
LightLLM/lightllm/server/core/objs/req.py Lines 104 to 110 in 5ef655a
LightLLM/lightllm/server/core/objs/req.py Lines 297 to 311 in 5ef655a 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Summary
Two small, independent improvements layered on top of the recent HTTP API consolidation (#1282):
SystemStatusReporterthat prints a compact status line with a windowed prefix-cache hit rate alongside the global one, gunicorn--access-logfileflags removed (the FastAPI access-log middleware introduced in feat(api): consolidate HTTP API endpoints and fixes #1282 already covers this), removal of a duplicate_ACCESS_LOG_STATUS_COLORSdeclaration inapi_http.py, and a sweep of noisy per-request / per-batch progress logs from INFO down to DEBUG. Also clears one flake8 F841 in the detokenization manager.max_req_total_len * 8chars before tokenization, and route the resultingValueErrorthrough a dedicatedexceptbranch inHttpServerManager.generatethat logs at WARNING, releases multimodal resources, aborts the in-flight group request, and re-raises so the API layer (which already mapsValueError → 400) returns a clean error instead of stalling the tokenizer or surfacing a 500.No behavior changes outside of log levels, status-line formatting, and the new prompt size cap. The class-based access-log middleware and the OpenAI-style error envelope from earlier internal commits are intentionally not re-applied here — they already landed in #1282.
Test plan
python -m py_compileon every touched file (run locally; passes)--disable_log_statssuppresses itmax_req_total_len * 8chars to/generateand/v1/chat/completionsand verify a 400 with the OpenAI-style error envelope (no 500, no hang)