[None][fix] bypass tokenizer in kvcache router when there is only one server by reasonsolo · Pull Request #14030 · NVIDIA/TensorRT-LLM

reasonsolo · 2026-05-12T05:18:56Z

Note: Usually this should not be a problem, since kvcache aware router is not supposed to be used in only one P/D worker cases.

Summary by CodeRabbit

Bug Fixes
- Enhanced request routing reliability with improved error handling that explicitly detects and properly reports when no suitable candidate servers are available for request processing.
- Optimized server selection logic for single-server scenarios, streamlining the request processing path to reduce unnecessary computational steps and improve overall performance.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-12T05:21:54Z

📝 Walkthrough

Walkthrough

The PR adds early-exit logic to KvCacheAwareRouter.get_next_server to optimize request routing. When server filtering leaves zero candidates, a RuntimeError is raised. When exactly one candidate remains, the request is registered immediately and returned with that server, bypassing the expensive KV-hash scoring workflow.

Changes

Early-exit optimization for server candidate selection

Layer / File(s)	Summary
Early-exit optimization and error handling in get_next_server `tensorrt_llm/serve/router.py`	When the server candidate list is empty after filtering, `RuntimeError` is raised. When exactly one candidate remains, the request is immediately registered and returned with that server, empty `block_hashes`/`token_lists`/`matches`, and `server_info`, bypassing KV-hash scoring.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is largely incomplete. It lacks key information including PR title, issue description, test coverage details, and proper explanation of the changes.	Add a proper PR title following the template (e.g., [TRTLLM-XXXX][fix] or [None][fix]), provide a clear description explaining what the change does and why it's needed, and specify relevant test coverage that validates the single-server router behavior.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: bypassing the tokenizer in the kvcache router when only one server is available.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/serve/router.py (2)

780-792: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid stale server selection in single-server fast path.

The single-server branch builds servers under lock, releases it, then re-locks at Line 790 to register. A concurrent server-list update can make that selected server stale before _register_request, causing inconsistent routing or lookup failures.

Proposed fix

     async def get_next_server(
             self,
             request: OpenAIRequest,
             exclude_server: Optional[str] = None) -> tuple[str, dict]:
-        async with self._lock:
-            servers = [
-                server for server in self._server_state.keys()
-                if server != exclude_server
-            ]
-        if len(servers) == 0:
-            raise RuntimeError(
-                f"No available servers after excluding {exclude_server}")
-        if len(servers) == 1:
-            server = servers[0]
-            async with self._lock:
-                await self._register_request(server, request)
-            return server, {
-                "block_hashes": [],
-                "token_lists": [],
-                "matches": [],
-                "server_info": self._server_info.get(server, {}),
-            }
+        async with self._lock:
+            servers = [
+                server for server in self._server_state.keys()
+                if server != exclude_server
+            ]
+            if len(servers) == 0:
+                raise RuntimeError(
+                    f"No available servers after excluding {exclude_server}")
+            if len(servers) == 1:
+                server = servers[0]
+                await self._register_request(server, request)
+                return server, {
+                    "block_hashes": [],
+                    "token_lists": [],
+                    "matches": [],
+                    "server_info": self._server_info.get(server, {}),
+                }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` around lines 780 - 792, The single-server fast
path can pick `server` from `servers` while holding `self._lock` but then
release the lock and later re-acquire it only to call `await
self._register_request(server, request)`, which allows the server to be removed
concurrently and become stale; fix by holding the lock across selection and
registration or by re-checking the selected server under `self._lock` before
calling `_register_request`: after re-acquiring `self._lock` verify `server in
self._server_state` (or recompute `servers` and re-select) and raise or retry if
it’s gone, ensuring `_register_request` is only called for a
currently-registered server.

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add/update the required NVIDIA copyright header for this modified file.

This file is modified but does not include the required source header with current modification year.

As per coding guidelines: **/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` at line 1, Add the required NVIDIA source
header to the top of the modified Python file tensorrt_llm/serve/router.py
(before any imports like the existing "import asyncio"); insert the NVIDIA
copyright/license block with the current modification year and the exact header
format used in other project files, replacing or adding the header if missing,
and ensure the header is the first lines of the file so automated license checks
recognize it.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tensorrt_llm/serve/router.py`:
- Around line 780-792: The single-server fast path can pick `server` from
`servers` while holding `self._lock` but then release the lock and later
re-acquire it only to call `await self._register_request(server, request)`,
which allows the server to be removed concurrently and become stale; fix by
holding the lock across selection and registration or by re-checking the
selected server under `self._lock` before calling `_register_request`: after
re-acquiring `self._lock` verify `server in self._server_state` (or recompute
`servers` and re-select) and raise or retry if it’s gone, ensuring
`_register_request` is only called for a currently-registered server.
- Line 1: Add the required NVIDIA source header to the top of the modified
Python file tensorrt_llm/serve/router.py (before any imports like the existing
"import asyncio"); insert the NVIDIA copyright/license block with the current
modification year and the exact header format used in other project files,
replacing or adding the header if missing, and ensure the header is the first
lines of the file so automated license checks recognize it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 190a67ff-50b3-4438-b71c-1693a73c446e

📥 Commits

Reviewing files that changed from the base of the PR and between bd39d26 and c50040d.

📒 Files selected for processing (1)

tensorrt_llm/serve/router.py

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · 2026-05-12T05:41:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-12T05:46:40Z

PR_Github #47886 [ run ] triggered by Bot. Commit: 8ec09f9 Link to invocation

reasonsolo requested a review from a team as a code owner May 12, 2026 05:18

reasonsolo requested a review from mikeiovine May 12, 2026 05:18

github-actions Bot assigned reasonsolo May 12, 2026

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

reasonsolo force-pushed the bypass-kvcache-singleserver branch from c50040d to 3fa893c Compare May 12, 2026 05:22

bypass tokenizer in kvcache router when there is only one server

8ec09f9

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the bypass-kvcache-singleserver branch from 3fa893c to 8ec09f9 Compare May 12, 2026 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] bypass tokenizer in kvcache router when there is only one server#14030

[None][fix] bypass tokenizer in kvcache router when there is only one server#14030
reasonsolo wants to merge 1 commit into
NVIDIA:mainfrom
reasonsolo:bypass-kvcache-singleserver

reasonsolo commented May 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

reasonsolo commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

reasonsolo commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

reasonsolo commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reasonsolo commented May 12, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading