[Performance] Add compile integration for Triton RNN kernels by vmoens · Pull Request #3740 · pytorch/rl

vmoens · 2026-05-12T09:12:02Z

Summary

Wrap low-level GRU/LSTM Triton forward and backward launches in torch.library.custom_op so torch.compile sees opaque traceable operators.
Register fake/meta kernels and conservative vmap rules for the custom ops.
Extend RNN tests with CUDA compile-forward/backward and vmap-vs-loop parity coverage.
Extend the reset-backend benchmark with a --cudagraph option and document observed compile/cudagraph behavior.

Testing

git diff --cached --check before commit
python -m py_compile benchmarks/bench_gru_reset_backends.py torchrl/modules/tensordict_module/_rnn_triton.py torchrl/modules/tensordict_module/rnn.py test/test_tensordictmodules.py
PYTHONPATH=. pytest test/test_tensordictmodules.py -k "triton and (gru_module or lstm_module or custom_op_compile or custom_op_vmap)"
- Local result: 2 passed, 29 skipped, 260 deselected; CUDA/Triton tests skipped on this machine.

Notes

The vmap registration intentionally uses map semantics and launches one Triton call per mapped slice.
The Triton custom op remains opaque to compile; the main win is compatibility/fullgraph capture rather than fusing through the kernel body.

pytorch-bot · 2026-05-12T09:12:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3740

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull jobs on OSDC in pull requests shadow mode

❌ 4 New Failures

As of commit 5e750e8 with merge base 3df2f4a ():

NEW FAILURES - The following jobs have failed:

Lint / lint-done (gh)
Process completed with exit code 1.
Lint / python-source-and-configs / linux-job (gh)
RuntimeError: Command docker exec -t 3c9d7709f8f00166052e6628b6fa7514c026778ea3499580d592b70b017c0f35 /exec failed with exit code 1
SOTA Tests on Linux / tests (3.10, 13.0) / linux-job (gh)
RuntimeError: Command docker exec -t 9f7494dd9eebd76197a38286316f9dc84fb41f62e68f1d986e739263ccae2dc0 /exec failed with exit code 1
Unit-tests on Linux / tests-cpu (3.13) / linux-job (gh)
test/objectives/test_ddpg.py::TestDDPG::test_ddpg_prioritized_weights

This comment was automatically generated by Dr. CI and updates every 15 minutes.

- Extract the validate/flatten/fallback/unflatten scaffolding into ``_vmap_backward_via_flatten``; GRU and LSTM backward vmap rules only need a per-op ``_invoke`` closure that unpacks the flattened args, rebuilds ``shapes`` from them, and calls the impl. - Add a comment in ``_gru_backward_impl`` / ``_lstm_backward_impl`` explaining why the vmap path uses ``bmm`` for per-V weight reductions but keeps ``dgates_x_flat`` flat for the shared-weight ``dx`` matmul. - Mark the ``B % V`` check as a defensive guardrail.

Adds ``test_*_module_scan_vs_triton_under_vmap`` for both GRU and LSTM. The scan backend goes through standard PyTorch op dispatch and has no custom vmap rule, so it serves as a ground-truth reference for our hand-rolled flatten/unflatten path in the triton custom_op. Covers both forward and ``vmap(grad(loss))`` against the same shared-weight inputs.

The ``custom_op`` family (``torch.library.custom_op`` / ``register_fake`` / ``register_autograd``) is the only autograd entry point we ship now; the ``_GRUFn`` / ``_LSTMFn`` ``autograd.Function`` mirrors only ran on PyTorch < 2.4 builds, where the backend never advanced past prototype anyway. ``_check_triton_available`` now also requires the custom_op API so older PyTorch / Triton routes cleanly to scan/pad. Top-level ``gru_triton`` / ``lstm_triton`` raise a descriptive ``RuntimeError`` if called when the backend is unavailable. Net -149 LoC from the PR diff.

# Conflicts: # torchrl/modules/tensordict_module/_rnn_triton.py # torchrl/modules/tensordict_module/rnn.py

PyTorch 2.13 nightlies ship ``torch.library.register_autograd`` in a state where the auto-generated ``autograd.Function`` lacks ``setup_context``, breaking ``vmap(grad(custom_op_call(...)))`` with: RuntimeError: ... must override the setup_context staticmethod ... The same nightlies also assert ``False != True`` inside ``torch._higher_order_ops.scan`` when called through ``vmap(grad(...))``. Both failures are upstream, not bugs in this PR. Probe once at collection by trying a tiny ``vmap(grad(gru_triton(...)))`` call; skip the four affected tests when the probe fails. Forward-only ``vmap`` coverage in the same tests remains unconditional.

[Performance] Add compile integration for Triton RNN kernels

5050c22

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2026

github-actions Bot added Performance Performance issue or suggestion for improvement Benchmarks rl/benchmark changes Modules Integrations/torch_geometric Integrations labels May 12, 2026

vmoens added 7 commits May 12, 2026 14:26

vmap-backward

e8f15fd

linter

f8692c3

Merge remote-tracking branch 'origin/main' into rnn-cuda-kernel-2

1b7426d

# Conflicts: # torchrl/modules/tensordict_module/_rnn_triton.py # torchrl/modules/tensordict_module/rnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Add compile integration for Triton RNN kernels#3740

[Performance] Add compile integration for Triton RNN kernels#3740
vmoens wants to merge 8 commits into
pytorch:mainfrom
vmoens:rnn-cuda-kernel-2

vmoens commented May 12, 2026

Uh oh!

pytorch-bot Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vmoens commented May 12, 2026

Summary

Testing

Notes

Uh oh!

pytorch-bot Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3740

❗ 1 Active SEVs

❌ 4 New Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented May 12, 2026 •

edited

Loading