Logging colorization + access middleware cleanup + windowed cache stats by sufubao · Pull Request #1289 · ModelTC/LightLLM

sufubao · 2026-05-06T05:12:25Z

Summary

Two small, independent improvements layered on top of the recent HTTP API consolidation (#1282):

Logging refactor — colored level names on TTYs (plain in files), a SystemStatusReporter that prints a compact status line with a windowed prefix-cache hit rate alongside the global one, gunicorn --access-logfile flags removed (the FastAPI access-log middleware introduced in feat(api): consolidate HTTP API endpoints and fixes #1282 already covers this), removal of a duplicate _ACCESS_LOG_STATUS_COLORS declaration in api_http.py, and a sweep of noisy per-request / per-batch progress logs from INFO down to DEBUG. Also clears one flake8 F841 in the detokenization manager.
Oversized-prompt guard — reject string prompts longer than max_req_total_len * 8 chars before tokenization, and route the resulting ValueError through a dedicated except branch in HttpServerManager.generate that logs at WARNING, releases multimodal resources, aborts the in-flight group request, and re-raises so the API layer (which already maps ValueError → 400) returns a clean error instead of stalling the tokenizer or surfacing a 500.

No behavior changes outside of log levels, status-line formatting, and the new prompt size cap. The class-based access-log middleware and the OpenAI-style error envelope from earlier internal commits are intentionally not re-applied here — they already landed in #1282.

Test plan

python -m py_compile on every touched file (run locally; passes)
Start a dev server and verify the access middleware still emits one colored line per request (no double-logging from gunicorn)
Confirm the new status line renders both windowed and global cache-hit rates and that --disable_log_stats suppresses it
POST a prompt longer than max_req_total_len * 8 chars to /generate and /v1/chat/completions and verify a 400 with the OpenAI-style error envelope (no 500, no hang)
Spot-check that the demoted INFO→DEBUG logs are silent at the default log level

…request logs - Add ANSI color codes to log level names (TTY only; plain in files) - Introduce SystemStatusReporter with windowed prefix-cache hit rate alongside the global rate, plus a more compact status line - Drop gunicorn --access-logfile flags (FastAPI middleware now handles it) - Remove duplicate _ACCESS_LOG_STATUS_COLORS declaration in api_http.py - Downgrade noisy per-request / per-batch progress logs from INFO to DEBUG - Fix flake8 F841 (unused exception variable) in detokenization manager

… 400 Reject prompts whose character length exceeds max_req_total_len * 8 before tokenization, so a long string can no longer reach the tokenizer and stall the loop. The raised ValueError is caught one level up: log it at WARNING, release any held multimodal resources, abort the in-flight group request, and re-raise so the API layer (which already maps ValueError to HTTP 400) returns a graceful error to the client instead of a 500.

gemini-code-assist

Code Review

This pull request introduces a SystemStatusReporter to provide detailed periodic metrics including TPS, request counts, KV cache utilization, and prefix cache hit rates. It also refines logging by implementing ANSI color support, adjusting log levels from info to debug for high-frequency events, and removing redundant access logs. Additionally, a prompt length validation check was added to the HTTP server, and CUDA graph capturing now features progress bars with memory usage information. Feedback identifies a potential UnboundLocalError in the HTTP server's error handling and notes a missing update to router_statics in the router manager, which is critical for scheduling accuracy.

gemini-code-assist · 2026-05-06T05:14:29Z

+                    self.status_reporter.on_request_completed(
+                        input_len=req.input_len,
+                        output_len=req.shm_cur_output_len,
+                        cache_len=req.prompt_cache_len,
+                        mtp_accepted=req.mtp_accepted_token_num,
+                    )


The router_statics.update() call was removed from Batch.filter_out_finished_req but not added here. This prevents the EMA of output lengths from being updated, which is critical for accurate scheduling decisions. You should call it when a request completes to maintain correct system load estimation.

Suggested change

self.status_reporter.on_request_completed(

input_len=req.input_len,

output_len=req.shm_cur_output_len,

cache_len=req.prompt_cache_len,

mtp_accepted=req.mtp_accepted_token_num,

)

self.status_reporter.on_request_completed(

input_len=req.input_len,

output_len=req.shm_cur_output_len,

cache_len=req.prompt_cache_len,

mtp_accepted=req.mtp_accepted_token_num,

)

self.router_statics.update(req.shm_cur_output_len)

Good catch — confirmed regression. Upstream's Batch.filter_out_finished_req(shm_req_manager, router_statics) called router_statics.update(req.candetoken_out_len) per finished req, and the refactor that hoisted the loop into _filter_reqs_from_running_batch only added status_reporter.on_request_completed(...) and dropped the EMA update. Since the schedulers (chunked_prefill/impl.py, beam_impl.py, impl_for_pd_decode.py, impl_for_nixl_pd.py, …) read self.router.router_statics.ema_req_out_len for KV-budget estimation, leaving it stale was a real correctness hit.

Restored in 3218df5 — kept req.candetoken_out_len (matching upstream's exact value) for the EMA update, alongside the new on_request_completed call which uses shm_cur_output_len for the windowed/global throughput counters.

The try/except keeps the detokenization loop alive when a single recv_obj is malformed, but the failed request currently hangs the originating client until it disconnects. Note the missing piece: publish an ERROR finish_status back to httpserver so the client gets a 500 right away.

The earlier refactor that moved the finished-req loop out of Batch.filter_out_finished_req into Router._filter_reqs_from_running_batch forgot to keep the router_statics.update(candetoken_out_len) call, freezing ema_req_out_len at its initial value. Multiple schedulers (chunked_prefill, beam, pd_decode, nixl_pd) read that EMA for KV-budget estimation, so leaving it stale degraded scheduling accuracy.

Two correctness fixes flagged in PR review: 1. count_output_tokens(len(running_batch.reqs)) once per router loop is wrong — the router loop polls on schedule_time_interval, decoupled from inference, so this overcounts when the loop is faster than decode and undercounts when slower, and includes paused/prefill-only reqs. Track shm_cur_output_len per request and accumulate the delta each tick (with a tail settlement when the req is filtered out so we don't lose its last tokens to the post-final-tick window). 2. on_request_completed() and router_statics.update() now both run for aborted requests, whose candetoken_out_len is a short partial value. Restore the prior `if not req.is_aborted` guard so disconnects don't bias the output-length EMA used by KV-budget estimators.

Move the per-running-req shm_cur_output_len delta tracking from the router tick (~33 Hz) into SystemStatusReporter.maybe_print, which only runs once per log_stats_interval (>= 5s). The reporter now owns the per-req snapshot dict and exposes discard_req(req) for tail settlement when a req leaves the running batch, so the router loop's hot path no longer walks the batch every schedule cycle. Output TPS accuracy is unchanged: still based on real shm_cur_output_len deltas, with tail tokens settled at completion.

sufubao · 2026-05-07T08:56:13Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sufubao · 2026-05-07T09:56:34Z

Code review (correction)

Previous comment missed one P1. Found 1 issue:

Detokenization _add_new_group_req_index failures are now silently swallowed, leaving the request hung and shm resources unreclaimed. The new try/except Exception (added in this PR) catches the error inside the recv loop and continues, so the req is never inserted into req_id_to_out, gen_token_out never runs for it, and decode_req.req.can_released_mark (set only in remove_finished_reqs) stays False. Per the comment on Req.can_released_mark ("only after detokenization marks this True can the main manager release the request object"), Req.can_release() will never return True, so the request hangs and the shm slot leaks. The author's own TODO acknowledges this. Pre-PR, the exception escaped to the outer handler and terminated handle_loop() — drastic but visible; the new path is a silent leak. Consider not landing the swallow until the error path can mark the req aborted/releasable or notify httpserver.

LightLLM/lightllm/server/detokenization/manager.py

Lines 78 to 88 in a40264c

    
                   assert isinstance(recv_obj, GroupReqIndexes) 
        
                   try: 
        
                       self._add_new_group_req_index(recv_obj=recv_obj) 
        
                   except Exception: 
        
                       # TODO: publish an ERROR finish_status back to httpserver so the 
        
                       # client gets a 500 instead of hanging until disconnect. 
        
                       logger.exception("add new group req index has exception") 
        
               # 当队列中存在较多的请求时，将一次接受的数量上调 
        
               recv_max_count = min(int(recv_max_count * 1.3), 256) 
        
           except zmq.ZMQError:

For reference, the can-release contract:

LightLLM/lightllm/server/core/objs/req.py

Lines 100 to 108 in a40264c

    
           ("chunked_prefill_size", ctypes.c_int),  # 只有chunked prefill模式才使用的参数 
        
           ("prefix_token_ids", PrefixTokenIdsStruct),  # 只有 token_headling 模式使用的参数 
        
           # can_released_mark的作用是： 
        
           # 只有整个流程中的最后一个处理模块，一般是 detokenization 进程，标记这个参数为True后，主管理进程才能真 
        
           # 的释放请求对像。 
        
           ("can_released_mark", ctypes.c_bool), 
        
           # reward_model 使用的变量 
        
           ("reward_score", ctypes.c_float), 
        
           # 请求回复累计概率和

LightLLM/lightllm/server/core/objs/req.py

Lines 294 to 309 in a40264c

    
           def can_release(self): 
        
               # 只有管理节点有一个引用 
        
               ref_count_ok = self.ref_count == 1 
        
               can_released_mark = self.can_released_mark 
        
               if self.is_aborted and can_released_mark and ref_count_ok: 
        
                   return True 
        
               ok_finished_gen_req = self.finish_status.is_finished() or self.stop_str_matched 
        
               if ok_finished_gen_req and can_released_mark and ref_count_ok and self.out_tokens_queue.is_empty(): 
        
                   return True 
        
               return False

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sufubao · 2026-05-07T10:43:14Z

Code review (additional)

Found 1 more issue:

Config server access logs are removed with no replacement. The PR removes --access-logfile - from config_server_start's gunicorn args, but _AccessLogMiddleware is only installed on the main API app (lightllm/server/api_http.py:151). The config server uses a separate FastAPI app at lightllm/server/config_server/api_http.py:20 with no middleware install, so endpoints like /registered_objects and /allocate_global_unique_id_range lose access logging entirely. Either install _AccessLogMiddleware on the config-server app, or keep --access-logfile - for the config server.

config_server_start removes the gunicorn flag:

LightLLM/lightllm/server/api_start.py

Lines 601 to 613 in a40264c

    
               "hypercorn", 
        
               "--workers", 
        
               "1", 
        
               "--bind", 
        
               f"{args.config_server_host}:{args.config_server_port}", 
        
               "--log-level", 
        
               "info", 
        
               "--error-logfile", 
        
               "-", 
        
               "lightllm.server.config_server.api_http:app", 
        
               "--keep-alive", 
        
               f"{get_lightllm_gunicorn_keep_alive()}", 
        
           ]

config server FastAPI app has no middleware:

LightLLM/lightllm/server/config_server/api_http.py

Lines 18 to 25 in a40264c

    
           logger = init_logger(__name__) 
        
           app = FastAPI() 
        
           registered_pd_master_objs: Dict[str, PD_Master_Obj] = {} 
        
           registered_visual_server_objs: Dict[str, VIT_Obj] = {} 
        
           registered_pd_master_obj_lock = Lock() 
        
           registered_visual_server_obj_lock = Lock()

middleware is only attached to the main app:

LightLLM/lightllm/server/api_http.py

Lines 149 to 153 in a40264c



	app.add_middleware(_AccessLogMiddleware)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sufubao · 2026-05-07T12:59:49Z

Code review

Found 1 issue (re-review of 5ef655a6):

The detokenization error path now sets FinishStatus.FINISHED_ERROR and re-raises, but never adds the req to self.req_id_to_out, so remove_finished_reqs() never iterates over it and req.can_released_mark stays False forever. This violates the contract documented on Req.can_released_mark (the only setter is detokenization/manager.py inside remove_finished_reqs). Req.can_release() will never return True, the httpserver's recycle_resource_loop never frees the slot, and the shm req-object slot leaks. The req's out_tokens_queue also stays empty so the httpserver's _wait_to_token_package has nothing to consume — the client hangs until disconnect. The pub_to_httpserver.send_pyobj(None) wakeup carries no signal that the req should be aborted. To close the loop, the error branch should also req.can_released_mark = True and self.shm_req_manager.put_back_req_obj(req) (or push a terminal token to out_tokens_queue and let the normal recycle path run).

LightLLM/lightllm/server/detokenization/manager.py

Lines 69 to 74 in 5ef655a

    
                   self.req_id_to_out[req.request_id] = decode_req 
        
               except Exception as e: 
        
                   req.finish_status.set_status(FinishStatus.FINISHED_ERROR) 
        
                   raise e 
        
           return

can_released_mark contract:

LightLLM/lightllm/server/core/objs/req.py

Lines 104 to 110 in 5ef655a

    
           ("prefix_token_ids", PrefixTokenIdsStruct),  # 只有 token_headling 模式使用的参数 
        
           # can_released_mark的作用是： 
        
           # 只有整个流程中的最后一个处理模块，一般是 detokenization 进程，标记这个参数为True后，主管理进程才能真 
        
           # 的释放请求对像。 
        
           ("can_released_mark", ctypes.c_bool), 
        
           # reward_model 使用的变量 
        
           ("reward_score", ctypes.c_float),

can_release() requires can_released_mark:

LightLLM/lightllm/server/core/objs/req.py

Lines 297 to 311 in 5ef655a

    
           def can_release(self): 
        
               # 只有管理节点有一个引用 
        
               ref_count_ok = self.ref_count == 1 
        
               can_released_mark = self.can_released_mark 
        
               if self.is_aborted and can_released_mark and ref_count_ok: 
        
                   return True 
        
               ok_finished_gen_req = self.finish_status.is_finished() or self.stop_str_matched 
        
               if ok_finished_gen_req and can_released_mark and ref_count_ok and self.out_tokens_queue.is_empty(): 
        
                   return True 
        
               return False

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sufubao added 2 commits May 6, 2026 11:26

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

sufubao added 2 commits May 6, 2026 13:23

ModelTC deleted a comment from gemini-code-assist Bot May 6, 2026

sufubao added 2 commits May 6, 2026 15:28

fix

5ef655a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging colorization + access middleware cleanup + windowed cache stats#1289

Logging colorization + access middleware cleanup + windowed cache stats#1289
sufubao wants to merge 7 commits intoModelTC:mainfrom
sufubao:pr/logging-access-cache-stats

sufubao commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

sufubao May 6, 2026

Uh oh!

sufubao commented May 7, 2026

Uh oh!

sufubao commented May 7, 2026

Uh oh!

sufubao commented May 7, 2026

Uh oh!

sufubao commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sufubao commented May 6, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sufubao May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sufubao commented May 7, 2026

Code review

Uh oh!

sufubao commented May 7, 2026

Code review (correction)

Uh oh!

sufubao commented May 7, 2026

Code review (additional)

Uh oh!

sufubao commented May 7, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant