Skip to content

[TRTLLM-12627][ci] Narrow tensorrt_llm/serve/ MGPU trigger to disagg-only files#14022

Open
QiJune wants to merge 1 commit into
NVIDIA:mainfrom
QiJune:serve-mg
Open

[TRTLLM-12627][ci] Narrow tensorrt_llm/serve/ MGPU trigger to disagg-only files#14022
QiJune wants to merge 1 commit into
NVIDIA:mainfrom
QiJune:serve-mg

Conversation

@QiJune
Copy link
Copy Markdown
Collaborator

@QiJune QiJune commented May 12, 2026

Summary

  • Replace the tensorrt_llm/serve/ directory blanket in getMultiGpuFileChanged's relatedFileList with the 8 files under serve/ that have active disagg / multi-process integration (verified by from tensorrt_llm.llmapi.disagg_utils import ...).
  • Drops trigger for tool parsers, benchmark scripts, multimodal /protocol / response-handler code under serve/ — none of which have multi-rank semantics.
  • Files kept: openai_server.py, openai_disagg_server.py, openai_disagg_service.py, router.py, openai_client.py, cluster_storage.py, disagg_auto_scaling.py, metadata_server.py.

Why

tensorrt_llm/serve/ is the first broad pattern in the multi-GPU rule by 12-week hit volume: 53 commits / 19 solo-via-serve. Of those 19 solo commits, only 3 touched genuinely multi-process code paths (2× openai_client.py "fix lost requests", 1× router.py threading optimization). The other 16 were tool parsers, benchmark clients, multimodal API, harmony parsers, and PNG encoding — all single-process work that does not need MGPU coverage.

openai_server.py is in the keep list because — despite being mostly HTTP-layer code — it owns the disagg cluster worker
lifecycle (register / heartbeat / deregister via DisaggClusterWorker). Dropping it would create a real coverage gap for disagg cluster regressions.

Measured impact (12 weeks, 1017 commits)

Metric Before After Δ
Solo-via-serve commits 19 10 −9
Total MGPU triggers 346 337 −9
MGPU hit rate 34.0 % 33.1 % −0.88 pp

All 3 historical disagg-relevant solo commits still trigger MGPU under the narrow set (they touch router.py or openai_client.py). Zero measured loss of true-positive coverage.

Coverage preservation argument

A pattern is removable when (a) the dropped files lack multi-rank semantics, and (b) any commit that would have legitimately needed MGPU coverage still matches via some other pattern or the narrow serve subset.

Dropped sub-bucket 12w hits 12w solo Why no MGPU
tool_parser/ (16 files) 9 3 Pure string parsing
scripts/ (7 files) 7 2 Client-side benchmark scripts
Server API non-disagg (e.g., openai_protocol.py, chat_utils.py, harmony_adapter.py, postprocess_handlers.py, resource_governor.py, visual_gen_utils.py) 14 3 Single-process API / response handling

Caveats

  • Future risk: if a maintainer adds a new disagg-related file under serve/, it will not auto-trigger MGPU until added to this list. The blanket prefix today is "automatic" but indiscriminate; the narrow list trades that automation for false-positive reduction. All current disagg / cluster files (verified by disagg_utils import grep) are in the keep list.
  • perf_metrics.py also imports ServerRole from disagg_utils but uses it only as a metric label — no orchestration. Left out intentionally to maintain the "active multi-process integration" criterion.

Summary by CodeRabbit

  • Chores
    • Updated CI/CD pipeline configuration to expand multi-GPU testing triggers for additional serving module files.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
@QiJune QiJune requested review from a team as code owners May 12, 2026 02:56
@QiJune QiJune requested review from dpitman-nvda and yiqingy0 May 12, 2026 02:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR extends the Jenkins CI pipeline's multi-GPU test trigger detection to include eight additional serve-related Python modules from the tensorrt_llm package. Changes to cluster storage, auto-scaling, metadata, OpenAI client, server, and router implementations will now automatically trigger multi-GPU test execution.

Changes

Multi-GPU Test Trigger Expansion

Layer / File(s) Summary
Multi-GPU test trigger files
jenkins/L0_MergeRequest.groovy
The getMultiGpuFileChanged function's relatedFileList is expanded to recognize eight additional tensorrt_llm/serve/*.py modules (cluster_storage.py, disagg_auto_scaling.py, metadata_server.py, openai_client.py, openai_disagg_server.py, openai_disagg_service.py, openai_server.py, router.py) as multi-GPU related, broadening when the pipeline forces multi-GPU test execution based on PR file diffs.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: narrowing the MGPU trigger scope for tensorrt_llm/serve/ files to disagg-only files, with proper ticket reference and CI type prefix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description comprehensively explains the change, rationale, and impact with detailed analysis and caveats.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
jenkins/L0_MergeRequest.groovy (1)

941-948: ⚡ Quick win

Add an inline maintenance note for this curated serve allowlist.

Given this is intentionally selective, a short comment here about “update this list when new disagg/multi-process serve files are added” would reduce future missed-trigger risk.

💡 Suggested diff
         "tensorrt_llm/parameter.py",
+        // Keep this serve allowlist in sync with disagg/multi-process integration.
+        // If a new serve file participates in disagg flows, add it here.
         "tensorrt_llm/serve/cluster_storage.py",
         "tensorrt_llm/serve/disagg_auto_scaling.py",
         "tensorrt_llm/serve/metadata_server.py",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_MergeRequest.groovy` around lines 941 - 948, Add a short inline
maintenance comment immediately above this curated serve allowlist (the block
containing "tensorrt_llm/serve/cluster_storage.py",
"tensorrt_llm/serve/disagg_auto_scaling.py",
"tensorrt_llm/serve/metadata_server.py", "tensorrt_llm/serve/openai_client.py",
"tensorrt_llm/serve/openai_disagg_server.py",
"tensorrt_llm/serve/openai_disagg_service.py",
"tensorrt_llm/serve/openai_server.py", "tensorrt_llm/serve/router.py")
indicating that the list is intentionally selective and must be updated whenever
new disagg or multi-process serve files are added to avoid missed CI triggers;
keep the note brief and actionable so reviewers and future editors see it when
maintaining the allowlist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@jenkins/L0_MergeRequest.groovy`:
- Around line 941-948: Add a short inline maintenance comment immediately above
this curated serve allowlist (the block containing
"tensorrt_llm/serve/cluster_storage.py",
"tensorrt_llm/serve/disagg_auto_scaling.py",
"tensorrt_llm/serve/metadata_server.py", "tensorrt_llm/serve/openai_client.py",
"tensorrt_llm/serve/openai_disagg_server.py",
"tensorrt_llm/serve/openai_disagg_service.py",
"tensorrt_llm/serve/openai_server.py", "tensorrt_llm/serve/router.py")
indicating that the list is intentionally selective and must be updated whenever
new disagg or multi-process serve files are added to avoid missed CI triggers;
keep the note brief and actionable so reviewers and future editors see it when
maintaining the allowlist.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9e888c85-0c80-450f-94a3-bf41721e7063

📥 Commits

Reviewing files that changed from the base of the PR and between 64260ba and dcdb288.

📒 Files selected for processing (1)
  • jenkins/L0_MergeRequest.groovy

Copy link
Copy Markdown
Collaborator

@JunyiXu-nv JunyiXu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants