Skip to content

[WIP] Fix triton RNN kernel#3785

Open
vmoens wants to merge 9 commits into
mainfrom
fix-triton-rnn
Open

[WIP] Fix triton RNN kernel#3785
vmoens wants to merge 9 commits into
mainfrom
fix-triton-rnn

Conversation

@vmoens
Copy link
Copy Markdown
Collaborator

@vmoens vmoens commented May 21, 2026

No description provided.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3785

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 2 Unrelated Failures

As of commit 79a2eb2 with merge base d596150 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 21, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

4 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 22, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 22, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 23, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 24, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

vmoens and others added 4 commits May 24, 2026 09:09
Inductor's Triton 3.6 frontend returned ``None`` when typing the 4-term
inline cell update ``c = f * c + i * g`` in the LSTM forward kernel
(observed inside torch.compile graphs), surfacing as
``AttributeError("'NoneType' object has no attribute 'type'")`` at the
``i``-token position. Rename the single-letter gate variables and split
the cell update into two adds so each subexpression is typed
independently.
Adds an optional ``log_env_packages`` flag (default ``True``) to
:class:`~torchrl.record.loggers.WandbLogger` that snapshots the Python
runtime, installed package versions, and editable source locations into
``wandb.config["env"]`` at run creation. Makes wandb runs comparable
across environments without manual book-keeping; covers the pip-installed
distribution set and surfaces editable installs via PEP 610
``direct_url.json``. Plumbed through the Hydra
``WandbLoggerConfig`` and exercised by ``test_logs_env_packages``.
Adds two off-by-default flags to the Isaac Lab recurrent PPO example:

- ``--compile-gae``: wraps the ``GAE`` advantage module with
  :func:`torch.compile`, mirroring the existing ``--compile-update``
  switch and exercising the value path under ``torch.compile`` when the
  scan/triton recurrent backends are used.
- ``--env-compile``: forwards ``compile=True`` through ``make_env`` to
  the ``TransformedEnv`` constructor, opting into the recently added
  env-side ``step_and_maybe_reset`` compilation.

Both flags also flip ``torch._dynamo.config.capture_scalar_outputs`` so
the existing recurrent-state branching paths trace cleanly.
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 24, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Compiling ``TransformedEnv.step_and_maybe_reset`` via the
``compile=True`` env constructor kwarg crashed with
``RecursionError`` while dynamo traced over the env. The trigger was
``nn.Module.__dir__`` reading ``self._parameters`` /
``_buffers`` / ``_modules`` through the descriptor protocol; when
those reads fell through ``__getattr__``, the existing fallback hit
``"base_env" in self.__dir__()``, which re-entered ``__getattr__`` for
``_parameters`` and infinite-looped under the tracer.

Short-circuit any attribute starting with ``_`` (already done for
``__``-prefixed names) before the ``__dir__``-based base-env lookup so
``nn.Module``'s own internals raise ``AttributeError`` immediately.
Forwarding private attributes to ``base_env`` was never the intent; the
public-attribute path is unchanged.

Add a regression test that calls ``__getattr__`` directly on the
nn.Module internals and asserts ``base_env`` is still present in
``dir(env)``.
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Replace boolean --env-compile with --env-compile-warmup N (default 0,
disabled). When N>0, pass compile_env={"warmup": N} to make_env so the
first N step_and_maybe_reset calls run eagerly and populate
Transform.parent / done_keys / obs_keys caches before torch.compile
traces.

This avoids the failure mode where dynamo wraps a half-initialised
TransformedEnv built by Compose._rebuild_up_to during the compiled
trace and trips an AttributeError on _parameters in nn.Module
machinery.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Avoid passing a TensorDict keys view directly through starred arguments in
TransformedEnv._reset. Dynamo cannot unpack that user-defined iterable when
tracing the compiled reset path, so turn it into a plain list before calling
select.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Plumb a configurable `tl.dot` precision through the triton GRU/LSTM kernels
and align the wrapper-side `F.linear` and `dW` matmuls. Adds a new
``recurrent_matmul_precision`` kwarg on ``LSTMModule``/``GRUModule`` with
modes ``"ieee"``, ``"tf32"``, ``"tf32x3"`` and ``"auto"``; a global setter
``set_recurrent_matmul_precision`` and a ``TORCHRL_RNN_PRECISION`` env var;
and an ``"auto"`` resolution that falls back to
``torch.get_float32_matmul_precision`` with ``"high"`` mapped to ``"tf32x3"``.

This fixes the silent precision mismatch where the recurrent matmul ran
TF32 (Triton default) while the input matmul ran IEEE FP32 (cuBLAS default
with ``allow_tf32=False``), which produced a small but systematic gradient
bias in long-T RL rollouts and pushed the triton training curve below the
scan curve on long PPO runs.

Tests:
- 9 new CPU tests for the global setter, env var, validation, resolution
  precedence and the cuBLAS context manager (flip + restore + exception
  safety).
- 6 new GPU-only parametrized backward parity tests
  (``ieee`` / ``tf32`` / ``tf32x3``) on ``LSTMModule`` and ``GRUModule``.
- 1 new GPU-only test that the per-module kwarg beats the process-global
  override.

Authored with Claude.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 26, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

The GPU CI runner uses ``-m gpu`` to filter pytest collection, which means
every test gated only by ``skipif(not _has_triton, ...)`` was being
deselected on GPU runners and skipped on CPU runners — i.e. not covered by
CI at all. This is what allowed the triton recurrent matmul precision bug
to ship unnoticed.

Add ``@pytest.mark.gpu`` to all 15 triton-gated tests in
``test/modules/test_rnn.py`` (existing ones plus the precision-control
tests added in the previous commit) so they get picked up by the
``tests-gpu`` job, shard 3.

After this change pytest reports 40 GPU-marked tests vs 187 CPU tests in
this file.

Authored with Claude.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 26, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [WIP].

Current title: [WIP] Fix triton RNN kernel

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Examples Integrations/torch_geometric Integrations Modules Record Trainers Transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant