Skip to content

[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917

Open
chang-l wants to merge 1 commit intoNVIDIA:mainfrom
chang-l:doc/custom-kernels-guide
Open

[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917
chang-l wants to merge 1 commit intoNVIDIA:mainfrom
chang-l:doc/custom-kernels-guide

Conversation

@chang-l
Copy link
Copy Markdown
Collaborator

@chang-l chang-l commented May 8, 2026

Summary

Adds a developer guide at docs/source/torch/adding_custom_kernels.md that walks through adding or modifying custom GPU kernels and exposing them through the TensorRT-LLM PyTorch backend. Covers source/JIT integration paths only — the trtllm-gen / pre-built CUBIN path is intentionally out of scope.

The guide covers three flavors:

  • CUDA C++ kernels under cpp/tensorrt_llm/kernels/ with Torch op bindings under cpp/tensorrt_llm/thop/ (the TORCH_LIBRARY_FRAGMENT / TORCH_LIBRARY_IMPL pattern).
  • CuTe DSL JIT kernels under tensorrt_llm/_torch/cute_dsl_kernels/ (with Blackwell variants under blackwell/).
  • cuTile JIT kernels under tensorrt_llm/_torch/cuda_tile_kernels/.

It includes a worked example based on indexer_k_cache_scatter_op (introduced in #8960), a testing checklist, and a list of common mistakes.

Description

The PyTorch backend already has many custom-op integrations across CUDA, CuTe DSL, and cuTile kernels, but there is currently no single customer-facing entry point that explains how to add one. New contributors typically rediscover the same patterns by reading existing code. This guide consolidates those patterns into one place so that an external contributor can:

  • pick the right flavor for their kernel,
  • locate the right directory and binding pattern in the repo,
  • understand the registration/fake-op contract, and
  • ship a kernel with adequate test coverage.

The guide is grounded in real, in-tree examples (no invented APIs):

  • cpp/tensorrt_llm/kernels/IndexerKCacheScatter.{h,cu} and cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp for the CUDA C++ path.
  • tensorrt_llm/_torch/cute_dsl_kernels/argmax.py and the Blackwell GEMM kernels for the CuTe DSL path.
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py and tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py for the cuTile path.

The new doc is placed next to the other PyTorch backend developer docs (adding_new_model.md, attention.md, arch_overview.md, kv_cache_manager.md, scheduler.md). It is not wired into docs/source/index.rst — none of the existing torch/*.md files are. Wiring the whole torch/ set into the toctree (or moving this one under developer-guide/) is left as a follow-up for maintainers if they want it indexed.

Test Coverage

Documentation-only change; no code paths exercised. Verified that:

  • Every file path / link in the guide resolves in the current repo (IndexerKCacheScatter.{h,cu}, IndexerKCacheScatterOp.cpp, dsa.py::_update_k_cache, cute_dsl_utils.py, cuda_tile_utils.py, cpp_custom_ops.py::_register_fake, test_indexer_k_cache_scatter_custom_op, etc.).
  • API references (TORCH_LIBRARY_FRAGMENT(trtllm, m), TORCH_LIBRARY_IMPL(trtllm, CUDA, m), @torch.library.custom_op("trtllm::..."), register_fake) match the actual usage in the repo.
  • Pre-commit hooks (codespell, EOF newline, etc.) all pass.

PR Checklist

  • PR description clearly explains what and why.
  • PR follows TRT-LLM coding guidelines.
  • Pre-commit hooks pass locally.
  • No code changes; CODEOWNERS updates not required.
  • Documentation updated as part of this PR (it is the PR).

Summary by CodeRabbit

Release Notes

  • Documentation
    • Added comprehensive guide on integrating custom GPU kernels through the PyTorch backend, including implementation approaches, end-to-end integration requirements, concrete examples, and troubleshooting guidance.

New developer guide at docs/source/torch/adding_custom_kernels.md that
walks through adding or modifying custom GPU kernels and exposing them
through the PyTorch backend. Covers source/JIT integration paths:

- CUDA C++ kernels under cpp/tensorrt_llm/kernels/ with Torch op
  bindings under cpp/tensorrt_llm/thop/ (TORCH_LIBRARY_FRAGMENT /
  TORCH_LIBRARY_IMPL pattern).
- CuTe DSL JIT kernels under tensorrt_llm/_torch/cute_dsl_kernels/.
- cuTile JIT kernels under tensorrt_llm/_torch/cuda_tile_kernels/.

Includes a worked example based on indexer_k_cache_scatter_op
(introduced in NVIDIA#8960), a testing checklist, and a list of common
mistakes. The trtllm-gen / pre-built CUBIN integration path is out
of scope for this guide.

Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
@chang-l chang-l requested a review from a team as a code owner May 8, 2026 19:46
@chang-l chang-l requested review from nv-guomingz and venkywonka May 8, 2026 19:46
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86fe0d41-6101-4986-8e26-6a633f730b45

📥 Commits

Reviewing files that changed from the base of the PR and between f8572ab and 45814fa.

📒 Files selected for processing (1)
  • docs/source/torch/adding_custom_kernels.md

📝 Walkthrough

Walkthrough

A new documentation guide for TensorRT-LLM developers explains how to integrate custom GPU kernels as torch.ops.trtllm operations. It covers both CUDA C++ and JIT kernel paths (CuTe/cuTile), includes a concrete indexer_k_cache_scatter_op walkthrough, and provides testing and validation checklists with common mistake avoidance.

Changes

Custom Kernel Integration Guide

Layer / File(s) Summary
Title and Overview
docs/source/torch/adding_custom_kernels.md
Document introduces TensorRT-LLM custom kernel integration scope and lists major sections in table of contents.
Conceptual Foundation
docs/source/torch/adding_custom_kernels.md
Defines custom kernels as GPU code wrapped by Torch custom ops exposed via torch.ops.trtllm.<name> and identifies four required deliverables: kernel, C++ binding, Python integration call site, and unit tests.
CUDA C++ Implementation Path
docs/source/torch/adding_custom_kernels.md
Details CUDA C++ kernel placement, C++ Torch op binding structure and registration, Python calling conventions, fake/meta registration for shape inference, and unit test location patterns.
JIT Kernel Implementation Path
docs/source/torch/adding_custom_kernels.md
Details CuTe DSL and cuTile JIT kernel locations, availability flags, skeleton patterns, Torch custom op wrapping with conditional imports, and runtime caveats including compute capability gating, DLPack/CUDA Graph stream handling, and compile-cache keying.
Concrete Walkthrough
docs/source/torch/adding_custom_kernels.md
End-to-end example of indexer_k_cache_scatter_op demonstrating kernel declaration, C++ binding/registration, PyTorch integration call site, and unit test structure.
Testing and Validation
docs/source/torch/adding_custom_kernels.md
Specifies minimum unit-test coverage for shapes, dtypes, devices, and contiguity; numerical correctness guidance; optional performance sanity checks; hardware availability skipping; and LLM_MODELS_ROOT setup instructions.
Common Mistakes Reference
docs/source/torch/adding_custom_kernels.md
Checklist of seven common pitfalls including missing op import/registration, contiguity assumptions, missing CMake entries, schema mismatches, missing binding checks, missing fake/meta registration, and incorrect JIT cache keys.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding a developer guide for integrating custom kernels in the PyTorch backend.
Description check ✅ Passed The PR description comprehensively covers the purpose, scope, implementation details, testing approach, and verification steps, fully satisfying the template requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant