[None][doc] Add guide for integrating custom kernels in PyTorch backend by chang-l · Pull Request #13917 · NVIDIA/TensorRT-LLM

chang-l · 2026-05-08T19:46:39Z

Summary

Adds a developer guide at docs/source/torch/adding_custom_kernels.md that walks through adding or modifying custom GPU kernels and exposing them through the TensorRT-LLM PyTorch backend. Covers source/JIT integration paths only — the trtllm-gen / pre-built CUBIN path is intentionally out of scope.

The guide covers three flavors:

CUDA C++ kernels under cpp/tensorrt_llm/kernels/ with Torch op bindings under cpp/tensorrt_llm/thop/ (the TORCH_LIBRARY_FRAGMENT / TORCH_LIBRARY_IMPL pattern).
CuTe DSL JIT kernels under tensorrt_llm/_torch/cute_dsl_kernels/ (with Blackwell variants under blackwell/).
cuTile JIT kernels under tensorrt_llm/_torch/cuda_tile_kernels/.

It includes a worked example based on indexer_k_cache_scatter_op (introduced in #8960), a testing checklist, and a list of common mistakes.

Description

The PyTorch backend already has many custom-op integrations across CUDA, CuTe DSL, and cuTile kernels, but there is currently no single customer-facing entry point that explains how to add one. New contributors typically rediscover the same patterns by reading existing code. This guide consolidates those patterns into one place so that an external contributor can:

pick the right flavor for their kernel,
locate the right directory and binding pattern in the repo,
understand the registration/fake-op contract, and
ship a kernel with adequate test coverage.

The guide is grounded in real, in-tree examples (no invented APIs):

cpp/tensorrt_llm/kernels/IndexerKCacheScatter.{h,cu} and cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp for the CUDA C++ path.
tensorrt_llm/_torch/cute_dsl_kernels/argmax.py and the Blackwell GEMM kernels for the CuTe DSL path.
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py and tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py for the cuTile path.

The new doc is placed next to the other PyTorch backend developer docs (adding_new_model.md, attention.md, arch_overview.md, kv_cache_manager.md, scheduler.md). It is not wired into docs/source/index.rst — none of the existing torch/*.md files are. Wiring the whole torch/ set into the toctree (or moving this one under developer-guide/) is left as a follow-up for maintainers if they want it indexed.

Test Coverage

Documentation-only change; no code paths exercised. Verified that:

Every file path / link in the guide resolves in the current repo (IndexerKCacheScatter.{h,cu}, IndexerKCacheScatterOp.cpp, dsa.py::_update_k_cache, cute_dsl_utils.py, cuda_tile_utils.py, cpp_custom_ops.py::_register_fake, test_indexer_k_cache_scatter_custom_op, etc.).
API references (TORCH_LIBRARY_FRAGMENT(trtllm, m), TORCH_LIBRARY_IMPL(trtllm, CUDA, m), @torch.library.custom_op("trtllm::..."), register_fake) match the actual usage in the repo.
Pre-commit hooks (codespell, EOF newline, etc.) all pass.

PR Checklist

PR description clearly explains what and why.
PR follows TRT-LLM coding guidelines.
Pre-commit hooks pass locally.
No code changes; CODEOWNERS updates not required.
Documentation updated as part of this PR (it is the PR).

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive guide on integrating custom GPU kernels through the PyTorch backend, including implementation approaches, end-to-end integration requirements, concrete examples, and troubleshooting guidance.

New developer guide at docs/source/torch/adding_custom_kernels.md that walks through adding or modifying custom GPU kernels and exposing them through the PyTorch backend. Covers source/JIT integration paths: - CUDA C++ kernels under cpp/tensorrt_llm/kernels/ with Torch op bindings under cpp/tensorrt_llm/thop/ (TORCH_LIBRARY_FRAGMENT / TORCH_LIBRARY_IMPL pattern). - CuTe DSL JIT kernels under tensorrt_llm/_torch/cute_dsl_kernels/. - cuTile JIT kernels under tensorrt_llm/_torch/cuda_tile_kernels/. Includes a worked example based on indexer_k_cache_scatter_op (introduced in NVIDIA#8960), a testing checklist, and a list of common mistakes. The trtllm-gen / pre-built CUBIN integration path is out of scope for this guide. Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>

coderabbitai · 2026-05-08T19:49:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86fe0d41-6101-4986-8e26-6a633f730b45

📥 Commits

Reviewing files that changed from the base of the PR and between f8572ab and 45814fa.

📒 Files selected for processing (1)

docs/source/torch/adding_custom_kernels.md

📝 Walkthrough

Walkthrough

A new documentation guide for TensorRT-LLM developers explains how to integrate custom GPU kernels as torch.ops.trtllm operations. It covers both CUDA C++ and JIT kernel paths (CuTe/cuTile), includes a concrete indexer_k_cache_scatter_op walkthrough, and provides testing and validation checklists with common mistake avoidance.

Changes

Custom Kernel Integration Guide

Layer / File(s)	Summary
Title and Overview `docs/source/torch/adding_custom_kernels.md`	Document introduces TensorRT-LLM custom kernel integration scope and lists major sections in table of contents.
Conceptual Foundation `docs/source/torch/adding_custom_kernels.md`	Defines custom kernels as GPU code wrapped by Torch custom ops exposed via `torch.ops.trtllm.<name>` and identifies four required deliverables: kernel, C++ binding, Python integration call site, and unit tests.
CUDA C++ Implementation Path `docs/source/torch/adding_custom_kernels.md`	Details CUDA C++ kernel placement, C++ Torch op binding structure and registration, Python calling conventions, fake/meta registration for shape inference, and unit test location patterns.
JIT Kernel Implementation Path `docs/source/torch/adding_custom_kernels.md`	Details CuTe DSL and cuTile JIT kernel locations, availability flags, skeleton patterns, Torch custom op wrapping with conditional imports, and runtime caveats including compute capability gating, DLPack/CUDA Graph stream handling, and compile-cache keying.
Concrete Walkthrough `docs/source/torch/adding_custom_kernels.md`	End-to-end example of `indexer_k_cache_scatter_op` demonstrating kernel declaration, C++ binding/registration, PyTorch integration call site, and unit test structure.
Testing and Validation `docs/source/torch/adding_custom_kernels.md`	Specifies minimum unit-test coverage for shapes, dtypes, devices, and contiguity; numerical correctness guidance; optional performance sanity checks; hardware availability skipping; and `LLM_MODELS_ROOT` setup instructions.
Common Mistakes Reference `docs/source/torch/adding_custom_kernels.md`	Checklist of seven common pitfalls including missing op import/registration, contiguity assumptions, missing CMake entries, schema mismatches, missing binding checks, missing fake/meta registration, and incorrect JIT cache keys.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the primary change: adding a developer guide for integrating custom kernels in the PyTorch backend.
Description check	✅ Passed	The PR description comprehensively covers the purpose, scope, implementation details, testing approach, and verification steps, fully satisfying the template requirements.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chang-l requested a review from a team as a code owner May 8, 2026 19:46

chang-l requested review from nv-guomingz and venkywonka May 8, 2026 19:46

github-actions Bot assigned chang-l May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917

[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917
chang-l wants to merge 1 commit intoNVIDIA:mainfrom
chang-l:doc/custom-kernels-guide

chang-l commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chang-l commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Test Coverage

PR Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chang-l commented May 8, 2026 •

edited by coderabbitai Bot

Loading