[TRTLLM-12440][feat] Add GMS-only weight sharing support by chienchunhung · Pull Request #13926 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-05-09T00:27:19Z

Summary

Adds LoadFormat.GMS so multiple TRT-LLM instances on the same node can zero-copy share model weights via the GPU Memory Service (GMS) pool. The first instance loads weights as the writer (RW); subsequent peers materialize them read-only (RO) without disk I/O or per-instance copies.

Scope

In scope:

LoadFormat.GMS enum value and nested GmsConfig (socket_path, mode, tag) on TorchLlmArgs.
New GMSBackend adapter under tensorrt_llm/_torch/memory/, lazily imported only when GMS is selected.
ModelLoader GMS branch with explicit RW / RO / unexpected-state handling, plus a guard that refuses to commit an unpopulated model to the pool.
Cleanup hook released via PyTorchModelEngine.__del__.

Adjacent (sub-PR, ~15 lines): remove the redundant MXCheckpointLoader.p2p_succeeded property; is_weights_preloaded() (the abstract hook from BaseCheckpointLoader) is now the single accessor. Tests updated accordingly.

Out of scope:

Networking and MX integration (the MX checkpoint format already merged in [TRTLLM-11851][feat] Add MX-only P2P checkpoint loading support for TRTLLM #13531 composes orthogonally with LoadFormat.GMS).
Packaging the gpu_memory_service dependency. Same OSS-allowlist concern as MX; users install it manually for now.

Test Coverage

New unit test files (mock-based, CPU CI):

tests/unittest/llmapi/test_gms_args.py — GmsConfig validation and LoadFormat.GMS Pydantic surface.
tests/unittest/_torch/memory/test_gms_backend.py — GMSBackend lifecycle and helpers.
tests/unittest/_torch/pyexecutor/test_model_loader_gms.py — ModelLoader GMS RW / RO / failure / edge-case branches.

Updated:

tests/unittest/_torch/models/checkpoints/mx/test_mx_checkpoint_loader.py — six call sites switched to is_weights_preloaded() after removing the redundant p2p_succeeded property.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-05-09T00:29:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-09T00:35:24Z

PR_Github #47456 [ run ] triggered by Bot. Commit: 03ddff2 Link to invocation

[TRTLLM-12440][feat] Add GMS-only weight sharing support

03ddff2

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

github-actions Bot assigned chienchunhung May 9, 2026

chienchunhung mentioned this pull request May 9, 2026

[None][test] Add checkpoint_format / load_format keys to test_features_contract #13933

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-12440][feat] Add GMS-only weight sharing support#13926

[TRTLLM-12440][feat] Add GMS-only weight sharing support#13926
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:trtllm-12440-gms-only

chienchunhung commented May 9, 2026 •

edited

Loading

Uh oh!

chienchunhung commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chienchunhung commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

chienchunhung commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chienchunhung commented May 9, 2026 •

edited

Loading