Skip to content

[CI] Fix olddeps / opt-deps / gym smoke tests broken by #3738 and #3704#3739

Merged
vmoens merged 2 commits into
mainfrom
fix-old-opt-gym
May 12, 2026
Merged

[CI] Fix olddeps / opt-deps / gym smoke tests broken by #3738 and #3704#3739
vmoens merged 2 commits into
mainfrom
fix-old-opt-gym

Conversation

@vmoens
Copy link
Copy Markdown
Collaborator

@vmoens vmoens commented May 12, 2026

Summary

Three currently red CI jobs on every PR / main:

  • unittests-gym — every gym/gymnasium smoke test errors at import with
    ModuleNotFoundError: No module named 'torch._higher_order_ops'. Caused by
    [Feature][Performance] Triton backend for GRU / LSTM with intermediate resets #3738, which added
    _has_torch_scan = importlib.util.find_spec(\"torch._higher_order_ops.scan\")
    to rnn.py. The job installs torch 2.0.1, and find_spec eagerly imports
    the missing parent and raises instead of returning None.
  • tests-olddeps — 3660 identical
    TypeError: ProbabilisticTensorDictModule.__init__() got an unexpected keyword argument 'generator'.
    [Feature] Forward generator kwarg through ProbabilisticActor #3704 unconditionally forwards generator= from SafeProbabilisticModule to
    the tensordict parent class, but the olddeps job was pinned to
    tensordict>=0.12.0,<0.13.0 (introduced in [BugFix] Fix old pytorch dependencies #3266 to guard against tensordict
    main dropping older Pythons), and that release line does not include the
    generator kwarg added in tensordict [BugFix] MaxValueWriter cuda compatibility #1689.
  • tests-optdeps — long-standing flake in
    test/test_distributions.py::TestDelta::test_tanhdelta_inv_ones: 4M unseeded
    float32 randn values occasionally exceed
    atanh(1 - finfo.resolution) ≈ 7.25, so SafeTanhTransform saturates and
    the inv∘fwd roundtrip can't recover the input.

Changes

  • torchrl/modules/tensordict_module/rnn.py — gate _has_torch_scan on
    version.parse(torch.__version__) >= version.parse(\"2.6.0\") (matching the
    existing @implement_for(\"torch\", \"2.6.0\", ...) decorator on _scan).
  • .github/unittest/linux_olddeps/scripts_gym_0_13/install.sh — install
    tensordict from source on PR/nightly runs, from PyPI only on release/*,
    matching linux/scripts/run_all.sh and linux_distributed/scripts/install.sh.
    Restores the pybind11[global] install needed by the source build.
  • .github/workflows/test-linux.yml — drop the now-unused
    TORCHRL_TENSORDICT_SPEC export.
  • test/test_distributions.py — seed and clamp inputs in
    test_tanhdelta_inv_ones so the test stays inside the SafeTanh roundtrip
    region.

Trade-off

Going back to source builds for olddeps re-exposes the brittleness #3266 was
guarding against: the day tensordict main drops Python 3.10 the olddeps job
breaks until we bump the Python version or temporarily re-pin. tensordict main
currently still declares requires-python = \">=3.10\", so this resolves today.

Test plan

  • unittests-gym passes (every gym version)
  • tests-olddeps passes (no more 3660 TypeError)
  • tests-optdeps passes (no flake in test_tanhdelta_inv_ones)
  • All other jobs green

* rnn.py: gate ``_has_torch_scan`` on ``torch.__version__ >= 2.6`` instead of
  ``importlib.util.find_spec("torch._higher_order_ops.scan")``. ``find_spec``
  eagerly imports the (missing) ``torch._higher_order_ops`` parent on
  torch < 2.4 and raises ``ModuleNotFoundError`` instead of returning ``None``,
  which broke every gym smoke test at import time (torch 2.0.1 stack).

* olddeps install.sh: install tensordict from source on PR / nightly runs and
  from PyPI only on ``release/*`` branches, matching every other CI job
  (``linux/scripts/run_all.sh``, ``linux_distributed/scripts/install.sh``).
  The previous ``tensordict>=0.12.0,<0.13.0`` pin (introduced in #3266 to
  guard against tensordict main dropping Python 3.10) froze us below the
  ``generator`` kwarg added in tensordict #1689, which #3704 forwards from
  ``SafeProbabilisticModule``, causing 3660 ``TypeError`` failures.

* test-linux.yml: drop the now-unused ``TORCHRL_TENSORDICT_SPEC`` export.

* test_distributions.py: ``test_tanhdelta_inv_ones`` was flaky on CUDA
  float32 -- with 4M unseeded ``randn`` samples a handful of draws had
  ``|x|`` past ``atanh(1 - finfo.resolution) ≈ 7.25`` and could not roundtrip
  through ``SafeTanhTransform``. Seed for determinism and clamp inputs into
  the non-saturated region.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3739

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2026
@github-actions github-actions Bot added CI Has to do with CI setup (e.g. wheels & builds, tests...) distributions Modules Integrations/torch_geometric Integrations labels May 12, 2026
Same bug pattern as the ``torch._higher_order_ops.scan`` probe: Triton 2.0
(shipped with torch 2.0.1 on the gym smoke matrix) lacks
``triton.language.extra``, so ``find_spec("triton.language.extra.libdevice")``
eagerly imports the missing parent and raises ``ModuleNotFoundError`` at
torchrl import time. Read the version from package metadata and gate on
``triton >= 2.2`` instead. Applied identically in ``rnn.py`` and
``_rnn_triton.py``.
@vmoens vmoens merged commit 3df2f4a into main May 12, 2026
110 checks passed
@vmoens vmoens deleted the fix-old-opt-gym branch May 12, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Has to do with CI setup (e.g. wheels & builds, tests...) CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. distributions Integrations/torch_geometric Integrations Modules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant