[TRTLLM-12627][ci] Narrow tensorrt_llm/serve/ MGPU trigger to disagg-only files#14022
[TRTLLM-12627][ci] Narrow tensorrt_llm/serve/ MGPU trigger to disagg-only files#14022QiJune wants to merge 1 commit into
Conversation
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR extends the Jenkins CI pipeline's multi-GPU test trigger detection to include eight additional serve-related Python modules from the tensorrt_llm package. Changes to cluster storage, auto-scaling, metadata, OpenAI client, server, and router implementations will now automatically trigger multi-GPU test execution. ChangesMulti-GPU Test Trigger Expansion
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
jenkins/L0_MergeRequest.groovy (1)
941-948: ⚡ Quick winAdd an inline maintenance note for this curated serve allowlist.
Given this is intentionally selective, a short comment here about “update this list when new disagg/multi-process serve files are added” would reduce future missed-trigger risk.
💡 Suggested diff
"tensorrt_llm/parameter.py", + // Keep this serve allowlist in sync with disagg/multi-process integration. + // If a new serve file participates in disagg flows, add it here. "tensorrt_llm/serve/cluster_storage.py", "tensorrt_llm/serve/disagg_auto_scaling.py", "tensorrt_llm/serve/metadata_server.py",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_MergeRequest.groovy` around lines 941 - 948, Add a short inline maintenance comment immediately above this curated serve allowlist (the block containing "tensorrt_llm/serve/cluster_storage.py", "tensorrt_llm/serve/disagg_auto_scaling.py", "tensorrt_llm/serve/metadata_server.py", "tensorrt_llm/serve/openai_client.py", "tensorrt_llm/serve/openai_disagg_server.py", "tensorrt_llm/serve/openai_disagg_service.py", "tensorrt_llm/serve/openai_server.py", "tensorrt_llm/serve/router.py") indicating that the list is intentionally selective and must be updated whenever new disagg or multi-process serve files are added to avoid missed CI triggers; keep the note brief and actionable so reviewers and future editors see it when maintaining the allowlist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@jenkins/L0_MergeRequest.groovy`:
- Around line 941-948: Add a short inline maintenance comment immediately above
this curated serve allowlist (the block containing
"tensorrt_llm/serve/cluster_storage.py",
"tensorrt_llm/serve/disagg_auto_scaling.py",
"tensorrt_llm/serve/metadata_server.py", "tensorrt_llm/serve/openai_client.py",
"tensorrt_llm/serve/openai_disagg_server.py",
"tensorrt_llm/serve/openai_disagg_service.py",
"tensorrt_llm/serve/openai_server.py", "tensorrt_llm/serve/router.py")
indicating that the list is intentionally selective and must be updated whenever
new disagg or multi-process serve files are added to avoid missed CI triggers;
keep the note brief and actionable so reviewers and future editors see it when
maintaining the allowlist.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9e888c85-0c80-450f-94a3-bf41721e7063
📒 Files selected for processing (1)
jenkins/L0_MergeRequest.groovy
Summary
tensorrt_llm/serve/directory blanket ingetMultiGpuFileChanged'srelatedFileListwith the 8 files underserve/that have active disagg / multi-process integration (verified byfrom tensorrt_llm.llmapi.disagg_utils import ...).serve/— none of which have multi-rank semantics.openai_server.py,openai_disagg_server.py,openai_disagg_service.py,router.py,openai_client.py,cluster_storage.py,disagg_auto_scaling.py,metadata_server.py.Why
tensorrt_llm/serve/is the first broad pattern in the multi-GPU rule by 12-week hit volume: 53 commits / 19 solo-via-serve. Of those 19 solo commits, only 3 touched genuinely multi-process code paths (2×openai_client.py"fix lost requests", 1×router.pythreading optimization). The other 16 were tool parsers, benchmark clients, multimodal API, harmony parsers, and PNG encoding — all single-process work that does not need MGPU coverage.openai_server.pyis in the keep list because — despite being mostly HTTP-layer code — it owns the disagg cluster workerlifecycle (register / heartbeat / deregister via
DisaggClusterWorker). Dropping it would create a real coverage gap for disagg cluster regressions.Measured impact (12 weeks, 1017 commits)
All 3 historical disagg-relevant solo commits still trigger MGPU under the narrow set (they touch
router.pyoropenai_client.py). Zero measured loss of true-positive coverage.Coverage preservation argument
A pattern is removable when (a) the dropped files lack multi-rank semantics, and (b) any commit that would have legitimately needed MGPU coverage still matches via some other pattern or the narrow serve subset.
tool_parser/(16 files)scripts/(7 files)openai_protocol.py,chat_utils.py,harmony_adapter.py,postprocess_handlers.py,resource_governor.py,visual_gen_utils.py)Caveats
serve/, it will not auto-trigger MGPU until added to this list. The blanket prefix today is "automatic" but indiscriminate; the narrow list trades that automation for false-positive reduction. All current disagg / cluster files (verified bydisagg_utilsimport grep) are in the keep list.perf_metrics.pyalso importsServerRolefromdisagg_utilsbut uses it only as a metric label — no orchestration. Left out intentionally to maintain the "active multi-process integration" criterion.Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.