[Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes by vmoens · Pull Request #3760 · pytorch/rl

vmoens · 2026-05-15T07:20:55Z

Stack from ghstack (oldest at bottom):

[Feature] Collector.fake_tensordict() / MultiCollector.fake_tensordict() #3761
-> [Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes #3760
[Feature] MultiCollector: track policy version per fresh continue command #3759
[Feature] Collector final_obs: store true boundary next-obs for shifted-GAE #3758
[BugFix] GAE shifted=True: tolerate missing next obs, non-canonical strides, docs #3757
[Feature] timeit.mark_start/mark_end for non-context-manager timing #3756
[BugFix] PolicyVersion: int64 dtype + preserve tensordict device #3755
[Performance] LSTM/GRU scan: canonical strides + cuDNN flat-storage clones + thread-local recurrent mode #3754
[BugFix] Recurrent policy auto-register with policy_factory #3753

End-to-end recurrent PPO example targeting Isaac Lab, exercising the
collector / GAE features added below in the stack:

compact_obs=True to drop redundant ("next", obs) from each
rollout step and keep the buffer affordable for vectorized envs.
final_obs=True to carry the true window-boundary observation
under ("final", k), so GAE(shifted=True) bootstraps with
V(s_T) rather than V(s_{T-1}) at every soft boundary.
recurrent_backend="scan" (or "triton") for the LSTM, with the
canonical-stride and cuDNN flat-storage fixes in place.
timeit.mark_start / mark_end to instrument the worker loop.
track_policy_version=True on the MultiCollector so each batch is
tagged with the policy version at collection time.

isaaclab_rnn_ppo_memory_utils.py factors the network builder, the
collector factory, the GAE+critic warmup, and the cudagraph fake batch
plumbing so the main script reads as a thin orchestration loop. CLI
flags surface the compile / cudagraph / vectorized-GAE / cudagraph-trees
knobs and a configurable warmup-iterations floor (which auto-bumps to
cudagraph_warmup + 1).

knowledge_base/ISAACLAB.md adds the pip-based container flow we use
on the cluster (uv venv when ensurepip is missing, IsaacLab pip wheels,
the EULA / privacy-consent env vars to make Kit non-interactive) and a
recurrent-PPO-on-Isaac recipe pinning the collector-owned policy_factory,
the weight-sync scheme, and the compact_obs + shifted-GAE pairing.

[ghstack-poisoned]

pytorch-bot · 2026-05-15T07:20:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3760

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 4 New Failures, 15 Pending

As of commit a179af1 with merge base 0a01ee8 ():

NEW FAILURES - The following jobs have failed:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Process completed with exit code 1.
Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Process completed with exit code 1.
Lint / lint-done (gh)
Process completed with exit code 1.
Lint / python-source-and-configs / linux-job (gh)
RuntimeError: Command docker exec -t 79b1c7246bbb92dc965b4c8a6a4e552bc0ca88a557b43f9d9050bb36dccdff88 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-15T07:21:11Z

⚠️ PR Title Label Error

Unknown or invalid prefix [Example].

Current title: [Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix	Label Applied	Example
`[BugFix]`	BugFix	`[BugFix] Fix memory leak in collector`
`[Feature]`	Feature	`[Feature] Add new optimizer`
`[Doc]` or `[Docs]`	Documentation	`[Doc] Update installation guide`
`[Refactor]`	Refactoring	`[Refactor] Clean up module imports`
`[CI]`	CI	`[CI] Fix workflow permissions`
`[Test]` or `[Tests]`	Tests	`[Tests] Add unit tests for buffer`
`[Environment]` or `[Environments]`	Environments	`[Environments] Add Gymnasium support`
`[Data]`	Data	`[Data] Fix replay buffer sampling`
`[Performance]` or `[Perf]`	Performance	`[Performance] Optimize tensor ops`
`[BC-Breaking]`	bc breaking	`[BC-Breaking] Remove deprecated API`
`[Deprecation]`	Deprecation	`[Deprecation] Mark old function`
`[Quality]`	Quality	`[Quality] Fix typos and add codespell`

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

github-actions · 2026-05-15T07:21:18Z

⚠️ PR Title Label Error

Unknown or invalid prefix [Example].

Current title: [Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix	Label Applied	Example
`[BugFix]`	BugFix	`[BugFix] Fix memory leak in collector`
`[Feature]`	Feature	`[Feature] Add new optimizer`
`[Doc]` or `[Docs]`	Documentation	`[Doc] Update installation guide`
`[Refactor]`	Refactoring	`[Refactor] Clean up module imports`
`[CI]`	CI	`[CI] Fix workflow permissions`
`[Test]` or `[Tests]`	Tests	`[Tests] Add unit tests for buffer`
`[Environment]` or `[Environments]`	Environments	`[Environments] Add Gymnasium support`
`[Data]`	Data	`[Data] Fix replay buffer sampling`
`[Performance]` or `[Perf]`	Performance	`[Performance] Optimize tensor ops`
`[BC-Breaking]`	bc breaking	`[BC-Breaking] Remove deprecated API`
`[Deprecation]`	Deprecation	`[Deprecation] Mark old function`
`[Quality]`	Quality	`[Quality] Fix typos and add codespell`

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

github-actions · 2026-05-15T07:49:49Z

⚠️ PR Title Label Error

Unknown or invalid prefix [Example].

Current title: [Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix	Label Applied	Example
`[BugFix]`	BugFix	`[BugFix] Fix memory leak in collector`
`[Feature]`	Feature	`[Feature] Add new optimizer`
`[Doc]` or `[Docs]`	Documentation	`[Doc] Update installation guide`
`[Refactor]`	Refactoring	`[Refactor] Clean up module imports`
`[CI]`	CI	`[CI] Fix workflow permissions`
`[Test]` or `[Tests]`	Tests	`[Tests] Add unit tests for buffer`
`[Environment]` or `[Environments]`	Environments	`[Environments] Add Gymnasium support`
`[Data]`	Data	`[Data] Fix replay buffer sampling`
`[Performance]` or `[Perf]`	Performance	`[Performance] Optimize tensor ops`
`[BC-Breaking]`	bc breaking	`[BC-Breaking] Remove deprecated API`
`[Deprecation]`	Deprecation	`[Deprecation] Mark old function`
`[Quality]`	Quality	`[Quality] Fix typos and add codespell`

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

vmoens · 2026-05-15T08:11:06Z

+launch_args = sys.argv[1:]
+if not any(arg == "--headless" or arg.startswith("--headless=") for arg in launch_args):
+    launch_args = [*launch_args, "--headless"]
+args_cli, _ = _parser().parse_known_args(launch_args)
+app_launcher = AppLauncher(args_cli)
+simulation_app = app_launcher.app
+
+
+def _restore_local_editable_paths() -> None:
+    for module_name in ("tensordict", "torchrl"):
+        module = sys.modules.get(module_name)
+        if module is not None and getattr(module, "__file__", None) is None:
+            del sys.modules[module_name]
+
+    for path in ("/root/tensordict", "/root/td", "/root/rl-isaac"):
+        if Path(path).exists() and path not in sys.path:
+            sys.path.insert(0, path)


that looks terribly hacky and ad-hoc, remove at all cost

vmoens · 2026-05-15T08:11:59Z

+
+os.environ.setdefault("TORCHRL_PROFILING", "1")
+
+from isaaclab.app import AppLauncher


isaac should only run in the sub-process. Let's ditch the regular collector and just use MultiCollector with sync true or false. No isaac in-process here.

vmoens · 2026-05-15T08:12:22Z

+from isaaclab.app import AppLauncher
+
+
+def _parser() -> argparse.ArgumentParser:


a proper script is imports -> methods and classes -> execution. This is all mixed up

vmoens · 2026-05-15T08:12:56Z

+    backbone = make_backbone(
+        args.obs_dim,
+        args.hidden_size,
+        args.rnn_backend,
+        train_device,
+    )
+    actor_head = make_actor_head(args.action_dim, args.hidden_size, train_device)
+    actor = make_actor_from_modules(
+        backbone,
+        actor_head,
+        args.action_dim,
+        train_device,
+    )
+    critic = make_critic_head(args.hidden_size, train_device)


this could be bundled in a single "make_models" util

vmoens · 2026-05-15T08:14:17Z

+    if gae_rnn_backend == "auto":
+        gae_rnn_backend = (
+            "cudnn" if args.rnn_backend in ("scan", "triton") else args.rnn_backend
+        )
+    elif gae_rnn_backend == "same":
+        gae_rnn_backend = args.rnn_backend


GAE should be same, always. If that causes an issue we need to solve the issue.

vmoens · 2026-05-15T08:14:35Z

+    gae_backbone = None
+    gae_full_value = full_value
+    if gae_rnn_backend != args.rnn_backend:
+        gae_backbone = make_backbone(
+            args.obs_dim,
+            args.hidden_size,
+            gae_rnn_backend,
+            train_device,
+        )
+        gae_full_value = make_full_value(gae_backbone, critic)
+    compile_gae_value_network = compile_gae and gae_backbone is None
+    gae_precompile_calls = 2 if compile_gae_value_network else 1


my previous comment should make this a bit redundant.

vmoens · 2026-05-15T08:15:14Z

+    def compute_advantages(batch: TensorDictBase):
+        with torch.no_grad(), set_recurrent_mode(True):
+            if gae_backbone is not None:
+                gae_backbone.load_state_dict(backbone.state_dict())
+            return adv_module(batch)


no load_state_dict, are we mad? Let's have one single copy of our weights in a given process.

vmoens · 2026-05-15T08:15:29Z

+    def compute_loss(batch: TensorDictBase):
+        with set_recurrent_mode(True):
+            return loss_module(batch)


the added value of a function like this is 0 or negative

vmoens · 2026-05-15T08:16:03Z

+        actor_grad_norm = torch.nn.utils.clip_grad_norm_(
+            actor_params, max_norm=float("inf")
+        )
+        critic_grad_norm = torch.nn.utils.clip_grad_norm_(
+            critic_params, max_norm=float("inf")
+        )


clip grad norm to 1 or 10 or something, use the args for that.

vmoens · 2026-05-15T08:17:19Z

+    def _make_warmup_minibatch() -> TensorDictBase:
+        mini_batch = _make_fake_training_batch(
+            batch_size=sample_num_slices,
+            seq_len=sample_seq_len,
+            obs_dim=args.obs_dim,
+            action_dim=args.action_dim,
+            hidden_size=args.hidden_size,
+            device=train_device,
+        )
+        with torch.no_grad(), set_recurrent_mode(True):
+            value_data = full_value(mini_batch.clone())
+        state_value = value_data.get(loss_module.tensor_keys.value, None)
+        if state_value is None:
+            raise KeyError(
+                f"Could not find value key {loss_module.tensor_keys.value} in "
+                "the fake warmup value-network output."
+            )
+        mini_batch.set(loss_module.tensor_keys.value_target, state_value.detach())
+        mini_batch.set(loss_module.tensor_keys.advantage, torch.zeros_like(state_value))
+        return _canonicalize_batch(mini_batch)


we should be able not to use that now. We've solved the issue that made it necessary separately. Let's just run update() in the training loop.

vmoens · 2026-05-15T08:17:35Z

+    def _make_warmup_rollout() -> TensorDictBase:
+        fake_rollout = _make_fake_training_batch(
+            batch_size=args.num_envs,
+            seq_len=args.rollout_steps,
+            obs_dim=args.obs_dim,
+            action_dim=args.action_dim,
+            hidden_size=args.hidden_size,
+            device=train_device,
+        )
+        if args.compact_obs:
+            next_td = fake_rollout.get("next", None)
+            if next_td is not None and "policy" in next_td.keys():
+                next_td.del_("policy")
+        return _canonicalize_batch(fake_rollout)
+


ditto looks useless to me now

vmoens · 2026-05-15T08:17:42Z

+    def _restore_after_precompile(
+        param_snapshots: tuple[torch.Tensor, ...],
+    ) -> None:
+        with torch.no_grad():
+            for param, snapshot in zip(model_params, param_snapshots):
+                param.copy_(snapshot)
+        for state in optim.state.values():
+            for key, value in state.items():
+                if torch.is_tensor(value):
+                    value.zero_()
+                elif isinstance(value, int):
+                    state[key] = 0
+                elif isinstance(value, float):
+                    state[key] = 0.0
+        optim.zero_grad(set_to_none=True)


ditto, ditch

vmoens · 2026-05-15T08:18:18Z

+                    state[key] = 0.0
+        optim.zero_grad(set_to_none=True)
+
+    def _precompile_training() -> int:


let's not precompile anything. Compile and execute and that's it. We'll swallow the compile time in the training loop, amen

vmoens · 2026-05-15T08:19:13Z

+    if args.double_buffer_collector:
+        collector_replay_buffer = ReplayBuffer(
+            storage=_make_storage(
+                args,
+                torch.device("cpu"),
+                args.num_envs,
+                storage_kind=args.collector_buffer_storage,
+                name="collector_replay_buffer",
+            ),
+            batch_size=args.num_envs,
+        )


let's remove the double buffer option. We can just use storing_device=cpu. It already uses a buffer in shared mem to pass the data which is a bit like what our double buffer does but more clearly.

vmoens · 2026-05-15T08:20:10Z

+        max_episode_steps=args.max_episode_steps,
+        device=str(collector_device),
+    )
+    collector_rnn_backend = "cudnn"


if set_recurrent_mode is False, the rnn backend should always be cudnn. That should be documented. We should not set it here, it should be automatic

vmoens · 2026-05-15T08:20:25Z

+    if args.collector_backend == "single":
+        collector = Collector(
+            make_env_fn,
+            policy_factory=collector_policy_factory,
+            frames_per_batch=frames_per_batch,
+            total_frames=-1,
+            policy_device=collector_device,
+            no_cuda_sync=True,
+            trust_policy=True,
+            storing_device=storing_device,
+            compact_obs=args.compact_obs,
+            auto_register_policy_transforms=True,
+            track_policy_version=True,
+            replay_buffer=collector_replay_buffer,
+            extend_buffer=True,
+        )


as pointed above this can be removed

vmoens · 2026-05-15T08:21:02Z

+    config = vars(args).copy()
+    config.update(
+        {
+            "frames_per_batch": frames_per_batch,
+            "per_collector_envs": per_collector_envs,
+            "collector_rnn_backend": collector_rnn_backend,
+            "training_rnn_backend": args.rnn_backend,
+            "training_gae_rnn_backend": gae_rnn_backend,
+            "updates_per_epoch": updates_per_epoch,
+            "storing_device": str(storing_device)
+            if storing_device is not None
+            else None,
+            "double_buffer_collector": args.double_buffer_collector,
+            "collector_replay_buffer": args.double_buffer_collector,
+            "collector_policy_sync": "before_next",
+            "collector_policy_version_tracking": True,
+            "collector_weight_sync_scheme": "MultiProcessWeightSyncScheme"
+            if args.collector_backend != "single"
+            else "single_process_fallback",
+            "debug_data_flow": args.debug_data_flow,
+            "training_precompile": args.precompile_training,
+            "training_compile_gae_requested": compile_gae,
+            "training_compile_gae": compile_gae_value_network,
+            "training_compile_gae_target": "value_network"
+            if compile_gae_value_network
+            else "none",
+            "training_precompile_gae_calls": gae_precompile_calls,
+            "training_precompile_update_calls": precompile_update_calls,
+        }
+    )


do we need all of this? Doesn't make the code more readable

vmoens · 2026-05-15T08:21:40Z

+                collector.update_policy_weights_(
+                    weights=TensorDict.from_module(actor).data
+                )
+                collector.increment_version()


that should be called automatically by update_policy_weights_ IMO - if not we need to patch.

vmoens · 2026-05-15T08:22:13Z

+                collector.update_policy_weights_(
+                    weights=TensorDict.from_module(actor).data
+                )


this looks very standard, we should be able to just pass the actor no? Doesn't update_policy_weights_ do that for us under the hood?

vmoens · 2026-05-15T08:22:40Z

+                if collector_replay_buffer is None:
+                    if collected_batch is None:
+                        raise RuntimeError(
+                            "collector returned None but no replay buffer was provided"
+                        )
+                    data = collected_batch


this path can be removed

vmoens · 2026-05-15T08:23:28Z

+                                args.rollout_steps,
+                                "gae_batch",
+                            )
+                            epoch_data = _canonicalize_batch(data.to(train_device))


_canonicalize_batch should be called in the RNN if anything, not in the loop. Users should not care about this.

vmoens · 2026-05-15T08:24:06Z

+                        if args.debug_data_flow and epoch == 0:
+                            stored_data = train_buffer[:]
+                            _add_data_flow_metrics(
+                                batch_metrics, "data_flow/train_buffer", stored_data
+                            )
+                            _add_buffer_compare_metrics(
+                                batch_metrics, epoch_data, stored_data
+                            )


remove all debug paths from this file

vmoens · 2026-05-15T08:24:39Z

+                            mini_batch = _canonicalize_batch(
+                                mini_batch.to(train_device)
+                            )


ditto this looks like something users will be distracted by. I believe the RNN does this op for us so we can spare it here

Update

a179af1

[ghstack-poisoned]

github-actions Bot added the Examples label May 15, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2026

vmoens mentioned this pull request May 15, 2026

[Feature] Collector.fake_tensordict() / MultiCollector.fake_tensordict() #3761

Closed

vmoens commented May 15, 2026

View reviewed changes

vmoens closed this May 15, 2026


		os.environ.setdefault("TORCHRL_PROFILING", "1")

		from isaaclab.app import AppLauncher

		from isaaclab.app import AppLauncher


		def _parser() -> argparse.ArgumentParser:

Conversation

vmoens commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3760

❗ 1 Active SEVs

❌ 4 New Failures, 15 Pending

Uh oh!

github-actions Bot commented May 15, 2026

⚠️ PR Title Label Error

Supported Prefixes (case-sensitive)

Uh oh!

github-actions Bot commented May 15, 2026

⚠️ PR Title Label Error

Supported Prefixes (case-sensitive)

Uh oh!

github-actions Bot commented May 15, 2026

⚠️ PR Title Label Error

Supported Prefixes (case-sensitive)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vmoens commented May 15, 2026 •

edited

Loading

pytorch-bot Bot commented May 15, 2026 •

edited

Loading