[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917
[None][doc] Add guide for integrating custom kernels in PyTorch backend#13917chang-l wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
New developer guide at docs/source/torch/adding_custom_kernels.md that walks through adding or modifying custom GPU kernels and exposing them through the PyTorch backend. Covers source/JIT integration paths: - CUDA C++ kernels under cpp/tensorrt_llm/kernels/ with Torch op bindings under cpp/tensorrt_llm/thop/ (TORCH_LIBRARY_FRAGMENT / TORCH_LIBRARY_IMPL pattern). - CuTe DSL JIT kernels under tensorrt_llm/_torch/cute_dsl_kernels/. - cuTile JIT kernels under tensorrt_llm/_torch/cuda_tile_kernels/. Includes a worked example based on indexer_k_cache_scatter_op (introduced in NVIDIA#8960), a testing checklist, and a list of common mistakes. The trtllm-gen / pre-built CUBIN integration path is out of scope for this guide. Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new documentation guide for TensorRT-LLM developers explains how to integrate custom GPU kernels as ChangesCustom Kernel Integration Guide
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
Summary
Adds a developer guide at
docs/source/torch/adding_custom_kernels.mdthat walks through adding or modifying custom GPU kernels and exposing them through the TensorRT-LLM PyTorch backend. Covers source/JIT integration paths only — the trtllm-gen / pre-built CUBIN path is intentionally out of scope.The guide covers three flavors:
cpp/tensorrt_llm/kernels/with Torch op bindings undercpp/tensorrt_llm/thop/(theTORCH_LIBRARY_FRAGMENT/TORCH_LIBRARY_IMPLpattern).tensorrt_llm/_torch/cute_dsl_kernels/(with Blackwell variants underblackwell/).tensorrt_llm/_torch/cuda_tile_kernels/.It includes a worked example based on
indexer_k_cache_scatter_op(introduced in #8960), a testing checklist, and a list of common mistakes.Description
The PyTorch backend already has many custom-op integrations across CUDA, CuTe DSL, and cuTile kernels, but there is currently no single customer-facing entry point that explains how to add one. New contributors typically rediscover the same patterns by reading existing code. This guide consolidates those patterns into one place so that an external contributor can:
The guide is grounded in real, in-tree examples (no invented APIs):
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.{h,cu}andcpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cppfor the CUDA C++ path.tensorrt_llm/_torch/cute_dsl_kernels/argmax.pyand the Blackwell GEMM kernels for the CuTe DSL path.tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.pyandtensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pyfor the cuTile path.The new doc is placed next to the other PyTorch backend developer docs (
adding_new_model.md,attention.md,arch_overview.md,kv_cache_manager.md,scheduler.md). It is not wired intodocs/source/index.rst— none of the existingtorch/*.mdfiles are. Wiring the wholetorch/set into the toctree (or moving this one underdeveloper-guide/) is left as a follow-up for maintainers if they want it indexed.Test Coverage
Documentation-only change; no code paths exercised. Verified that:
IndexerKCacheScatter.{h,cu},IndexerKCacheScatterOp.cpp,dsa.py::_update_k_cache,cute_dsl_utils.py,cuda_tile_utils.py,cpp_custom_ops.py::_register_fake,test_indexer_k_cache_scatter_custom_op, etc.).TORCH_LIBRARY_FRAGMENT(trtllm, m),TORCH_LIBRARY_IMPL(trtllm, CUDA, m),@torch.library.custom_op("trtllm::..."),register_fake) match the actual usage in the repo.PR Checklist
Summary by CodeRabbit
Release Notes