[None][fix] bypass tokenizer in kvcache router when there is only one server#14030
[None][fix] bypass tokenizer in kvcache router when there is only one server#14030reasonsolo wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThe PR adds early-exit logic to ChangesEarly-exit optimization for server candidate selection
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/serve/router.py (2)
780-792:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid stale server selection in single-server fast path.
The single-server branch builds
serversunder lock, releases it, then re-locks at Line 790 to register. A concurrent server-list update can make that selected server stale before_register_request, causing inconsistent routing or lookup failures.Proposed fix
async def get_next_server( self, request: OpenAIRequest, exclude_server: Optional[str] = None) -> tuple[str, dict]: - async with self._lock: - servers = [ - server for server in self._server_state.keys() - if server != exclude_server - ] - if len(servers) == 0: - raise RuntimeError( - f"No available servers after excluding {exclude_server}") - if len(servers) == 1: - server = servers[0] - async with self._lock: - await self._register_request(server, request) - return server, { - "block_hashes": [], - "token_lists": [], - "matches": [], - "server_info": self._server_info.get(server, {}), - } + async with self._lock: + servers = [ + server for server in self._server_state.keys() + if server != exclude_server + ] + if len(servers) == 0: + raise RuntimeError( + f"No available servers after excluding {exclude_server}") + if len(servers) == 1: + server = servers[0] + await self._register_request(server, request) + return server, { + "block_hashes": [], + "token_lists": [], + "matches": [], + "server_info": self._server_info.get(server, {}), + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/router.py` around lines 780 - 792, The single-server fast path can pick `server` from `servers` while holding `self._lock` but then release the lock and later re-acquire it only to call `await self._register_request(server, request)`, which allows the server to be removed concurrently and become stale; fix by holding the lock across selection and registration or by re-checking the selected server under `self._lock` before calling `_register_request`: after re-acquiring `self._lock` verify `server in self._server_state` (or recompute `servers` and re-select) and raise or retry if it’s gone, ensuring `_register_request` is only called for a currently-registered server.
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd/update the required NVIDIA copyright header for this modified file.
This file is modified but does not include the required source header with current modification year.
As per coding guidelines:
**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/router.py` at line 1, Add the required NVIDIA source header to the top of the modified Python file tensorrt_llm/serve/router.py (before any imports like the existing "import asyncio"); insert the NVIDIA copyright/license block with the current modification year and the exact header format used in other project files, replacing or adding the header if missing, and ensure the header is the first lines of the file so automated license checks recognize it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tensorrt_llm/serve/router.py`:
- Around line 780-792: The single-server fast path can pick `server` from
`servers` while holding `self._lock` but then release the lock and later
re-acquire it only to call `await self._register_request(server, request)`,
which allows the server to be removed concurrently and become stale; fix by
holding the lock across selection and registration or by re-checking the
selected server under `self._lock` before calling `_register_request`: after
re-acquiring `self._lock` verify `server in self._server_state` (or recompute
`servers` and re-select) and raise or retry if it’s gone, ensuring
`_register_request` is only called for a currently-registered server.
- Line 1: Add the required NVIDIA source header to the top of the modified
Python file tensorrt_llm/serve/router.py (before any imports like the existing
"import asyncio"); insert the NVIDIA copyright/license block with the current
modification year and the exact header format used in other project files,
replacing or adding the header if missing, and ensure the header is the first
lines of the file so automated license checks recognize it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 190a67ff-50b3-4438-b71c-1693a73c446e
📒 Files selected for processing (1)
tensorrt_llm/serve/router.py
c50040d to
3fa893c
Compare
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
3fa893c to
8ec09f9
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47886 [ run ] triggered by Bot. Commit: |
Note: Usually this should not be a problem, since kvcache aware router is not supposed to be used in only one P/D worker cases.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.