Fix cartpole-camera frame-stacking performance regression by jmart-nv · Pull Request #5849 · isaac-sim/IsaacLab

jmart-nv · 2026-05-28T21:44:39Z

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change	Why
uint8 ring buffer. Defer the `x/255 − mean` normalize past the frame-stack buffer for RGB-like data types.	Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New `CircularBuffer.stack_dim` arg + `.stacked` property arrange storage so the channel-stacked output is a contiguous view of internal storage.	Eliminates one `(B, H, W, K*C)`-sized allocation and a permute kernel per env.step.
Drop `torch.roll` on append. Replace with a front-to-back memmove.	One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (`normalize_image_uint8`). One kernel: `(uint8 / 255) − per-(batch, channel) mean → float32`.	Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (`spatial_sum_uint8_tiled`). Warp kernel writes `(B, NUM_TILES, C)` int32 partials along H; PyTorch collapses the partial dim.	int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared `isaaclab.utils.images.normalize_camera_image` dispatch. One helper used by both `image()` (DirectRLEnv) and `stacked_image.__call__` (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise.	Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
`image(clone=False)` opt-out. Callers that immediately copy into their own storage skip the defensive clone.	Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. `_get_observations` allocates the normalize output fresh each call (no persistent `out=` buffer).	The RL trainer holds a reference to the previous-iteration observation until `record_transition` returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:

Stage	FPS	step_ms	Peak VRAM	Mean reward (32k)
Pre-framestack baseline	54.62	18.67	3.13 GB	~102 (unsolvable — no temporal info)
Frame stacking from #5574	31.72	32.07	8.37 GB	297.68
This PR	67.87	15.32	4.41 GB	297.60

Screenshots

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

…isaac-sim#5574. - circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property. - observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates). - cartpole_camera_presets_env.py: same `.clone()` drop in the direct env. - delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.

…t on append

…stack consumers

…ut reshape

Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.

Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv. Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.

The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).

- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions - normalize_image_uint8: docstring warning on the out= aliasing hazard - CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage - CircularBuffer.stacked: drop unreachable empty-buffer guard - _uint8_sum_partials_cache: store the tensor directly, not a tuple - Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Incremental Update (`a86c6a7`)

The author addressed review feedback in commit a86c6a7:

✅ Docstring improvements — stack_dim parameter now includes a concrete example (stack_dim=-1 on (B, H, W, C) → (B, H, W, K*C)) making the API much more intuitive.

✅ In-place subtract in normalize fallback — images.py now uses images -= torch.mean(...) instead of allocating a new tensor, addressing the memory optimization suggestion.

✅ Cache growth documented — The _uint8_sum_partials_cache comment now states it "Typically holds one entry per training run (camera resolution and device are fixed)," making the bounded growth clear.

✅ Tile size rationale — _UINT8_SUM_TILE_HW comment now documents it was "Tuned on L40; robust across modern NVIDIA arches (Ampere/Ada/Hopper) at R256," addressing the hardcoded value concern.

✅ Shift loop documented — Added "Cheap at the typical frame-stack K=2-4" comment clarifying expected usage.

✅ Minor fix — Corrected TiledCameraCfg → CameraCfg in is_rgb_like docstring.

Test Coverage

✅ Excellent coverage (unchanged from previous push). The PR includes comprehensive tests for the circular buffer stack_dim mode, normalize dispatch paths, Warp kernel correctness, and critically a consecutive-call aliasing regression test.

CI Status

⏳ In progress — 9 checks passed, 6 running, 2 skipped on latest commit. No failures.

Verdict

Approve — All review suggestions have been addressed cleanly in the follow-up commit. The changes are documentation and minor optimization improvements that don't affect correctness. Ready to merge once CI completes.

…y-up) - images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class) - images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path) - circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop - ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32

greptile-apps · 2026-06-03T14:26:34Z

Greptile Summary

This PR fixes a 42% throughput regression and 2.7× VRAM inflation from frame-stacking (#5574), delivering 24% faster throughput than the pre-framestack baseline. The rewrite defers RGB normalization past the ring buffer (uint8 storage), replaces torch.roll with a memmove loop, adds stack_dim to CircularBuffer for a zero-copy stacked view, and fuses normalize into a two-kernel Warp dispatch.

CircularBuffer.stack_dim: rearranges storage to (B, …, K, …) so .stacked is a free reshape view; a _need_reset CPU flag skips the per-append GPU sync in steady state.
isaaclab.utils.images: new normalization dispatch module routing contiguous uint8 input to a Warp kernel and falling back to PyTorch otherwise.
_uint8_spatial_mean: tiled int32 partial-sum reduction correct at R256, but silently overflows for images wider than ~2902×2902 pixels.

Confidence Score: 3/5

Safe at the stated R256 target; _uint8_spatial_mean produces silently wrong normalization for images wider than ~2902 pixels per side, which would corrupt training observations at larger camera resolutions without raising any error.

The _uint8_spatial_mean function accumulates per-channel pixel sums in int32 and collapses them with partials.sum(dim=1).float(). When H × W × 255 exceeds INT32_MAX the total wraps around silently — PPO receives incorrect zero-mean observations and trains on corrupted data with no exception. The function is now callable from the public normalize_camera_image API. The fix is a single dtype= argument. All other logic — ring-buffer memmove, warmup-fill across partial resets, aliasing prevention via fresh allocation, and the clone=False path — checks out correctly.

source/isaaclab/isaaclab/utils/warp/ops.py — the _uint8_spatial_mean return line needs dtype=torch.int64 promotion before this function is used at resolutions above R256.

Important Files Changed

Filename	Overview
source/isaaclab/isaaclab/utils/warp/ops.py	Adds `normalize_image_uint8` and `_uint8_spatial_mean`; int32 accumulator overflows silently for images larger than ~2902×2902 pixels.
source/isaaclab/isaaclab/utils/buffers/circular_buffer.py	Adds `stack_dim` mode with memmove append and `_need_reset` CPU gate; misleading RuntimeError when stack_dim set but buffer uninitialized.
source/isaaclab/isaaclab/utils/images.py	New normalization dispatch module; logic is correct and well-tested.
source/isaaclab/isaaclab/envs/mdp/observations.py	Deferred RGB normalize and `clone=False` opt-out implemented correctly; fresh allocation per call prevents trainer obs aliasing.
source/isaaclab_tasks/isaaclab_tasks/direct/cartpole/cartpole_camera_env.py	Inline normalize consolidated; clone/normalize kwargs leak optimization intent into the framework hook signature.
source/isaaclab_tasks/isaaclab_tasks/direct/cartpole/cartpole_camera_presets_env.py	Adopts stack_dim ring buffer and deferred normalize correctly; clone-on-return preserved for non-RGB paths.
source/isaaclab/isaaclab/utils/warp/kernels.py	Two new Warp kernels; tile clamping and per-tile int32 accumulation are safe at target resolutions.
source/isaaclab/isaaclab/utils/buffers/delay_buffer.py	Removes redundant .clone(); advanced indexing already returns an independent copy.

Comments Outside Diff (1)

source/isaaclab/isaaclab/utils/warp/ops.py, line 590 (link)

Silent int32 overflow for images larger than ~2902×2902

partials.sum(dim=1) accumulates per-(batch, channel) totals in int32 without promotion. When H × W × 255 > INT32_MAX (≈ 8.4 M pixels) the mean is silently wrong — normalisation is incorrect, PPO trains on corrupted observations, no exception is raised. The function is now reachable from normalize_camera_image as public API. Promoting via .sum(dim=1, dtype=torch.int64).float() removes the ceiling at negligible runtime cost.

_{Reviews (1): Last reviewed commit: "Address PR #5849 review feedback (docstr..." | Re-trigger Greptile}

greptile-apps · 2026-06-03T14:26:38Z

+        if self._stack_dim_internal is None:
+            raise RuntimeError("stacked is only available when CircularBuffer was created with stack_dim set.")


Misleading RuntimeError when stack_dim set but buffer uninitialized

When constructed with stack_dim set but append() never called, _stack_dim_internal is still None. Accessing .stacked raises RuntimeError("stacked is only available when CircularBuffer was created with stack_dim set.") — the opposite of the truth. Differentiating the two None cases avoids the misleading diagnostic.

Suggested change

if self._stack_dim_internal is None:

raise RuntimeError("stacked is only available when CircularBuffer was created with stack_dim set.")

if self._stack_dim_internal is None:

if self._stack_dim_arg is None:

raise RuntimeError("stacked is only available when CircularBuffer was created with stack_dim set.")

raise RuntimeError("stacked is not yet available: call append() at least once to initialize the buffer.")

greptile-apps · 2026-06-03T14:26:40Z

        self._cartpole.set_joint_effort_target_index(target=self.actions, joint_ids=self._cart_dof_idx)

-    def _get_observations(self) -> dict:
+    def _get_observations(self, clone: bool = True, normalize: bool = True) -> dict:


Optimization parameters leak into a framework override point

DirectRLEnv._get_observations() is called without arguments by the framework; adding clone and normalize with defaults preserves compatibility, but these parameters exist only to serve CartpoleCameraPresetsEnv's internal frame-stack optimization. Future subclasses and introspective tooling will see them as public API. Factoring the camera-data extraction into a private helper (_get_camera_data) and keeping _get_observations() zero-argument would be cleaner.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

jmart-nv · 2026-06-03T16:58:25Z

Moving this back to draft while I rebase onto the cartpole refactor

jmart-nv added 10 commits May 28, 2026 15:16

Drop read-side torch.roll from CircularBuffer.buffer via memmove-shif…

32024a4

…t on append

Add CircularBuffer.copy_to_stacked and clone=False opt-out for frame-…

347f976

…stack consumers

Add stack_dim mode to CircularBuffer for free contiguous stacked-outp…

2dad31b

…ut reshape

Changelog fragments for framestack performance regression fix

8222de6

Restore original comments that were accidentally trimmed

c554513

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

pbarejko reviewed May 29, 2026

View reviewed changes

Comment thread source/isaaclab/isaaclab/utils/images.py Outdated

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

pbarejko approved these changes Jun 2, 2026

View reviewed changes

jmart-nv marked this pull request as ready for review June 3, 2026 14:05

jmart-nv requested review from jtigue-bdai, kellyguo11, ooctipus and pascal-roth as code owners June 3, 2026 14:05

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

jmart-nv marked this pull request as draft June 3, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cartpole-camera frame-stacking performance regression#5849

Fix cartpole-camera frame-stacking performance regression#5849
jmart-nv wants to merge 11 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

jmart-nv commented May 28, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

jmart-nv commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if self._stack_dim_internal is None:
		raise RuntimeError("stacked is only available when CircularBuffer was created with stack_dim set.")

Conversation

jmart-nv commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Results

Screenshots

Environment

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Incremental Update (a86c6a7)

Test Coverage

CI Status

Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

jmart-nv commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmart-nv commented May 28, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

Incremental Update (`a86c6a7`)

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading