Eliminate GPU sync overhead and CPU→GPU transfers across LTX2 pipeline by ViktoriiaRomanova · Pull Request #13564 · huggingface/diffusers

ViktoriiaRomanova · 2026-04-26T21:37:51Z

Fixes performance issues identified by profiling LTX2 with torch.profiler as part of #13401.

Optimises LTX2 by removing unnecessary GPU synchronisation points and replacing CPU tensor creation with on-device tensor operations across the decoding pipeline, transformer RoPE computations, scheduler, and connector padding logic.

Pipeline Denoising Optimisation

Added explicit set_begin_index(0) calls to both video and audio schedulers. This avoids the DtoH sync in _init_step_index. Uses the same pattern as the issue fixed in PR Avoid DtoH sync from access of nonzero() item in scheduler #11696.
Before (eager mode):

After (eager mode, no sync gap):

Before (compile mode):

After (compile mode, no sync gap):

Replaced torch.tensor(..., device=device) with on-device torch.stack([torch.ones(...)*s for s in decode_noise_scale]). Avoids CPU tensor allocation and GPU transfers for decode noise scaling.

Transformer Model Optimisation

Replaced CPU tensor creation for patch sizes with on-device tensor construction.
Eliminates unnecessary CPU-to-GPU memcpy operations during RoPE coordinate preparation.

Connector Refactoring

Replaced list-comprehension-based padding logic with vectorised masking. This simplifies left-padding layout logic and eliminates unnecessary cudaStreamSynchronize calls.

Performance Results

Metric	Before	After
cudaStreamSynchronize calls (total)	18	6
Scheduler sync (eager mode)	233ms	eliminated
Scheduler sync (compiled mode)	573ms	eliminated
Other syncs total (eager mode)	88ms	25ms
Other syncs total (compiled mode)	93ms	25ms

Profiler trace

https://drive.google.com/drive/folders/1cZn1xw-8Eon22mA2zP1uoF1nE4YCC3Wo?usp=drive_link

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Help us profile important pipelines and improve if needed #13401
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @dg845

…or creation across the LTX2 pipeline, transformer, scheduler, and connector logic. - Add set_begin_index(0) to schedulers to eliminate DtoH sync in _init_step_index - Replace torch.tensor(..., device=...) with on-device tensor construction for decode scaling - Move RoPE-related tensor creation to GPU to avoid memcpy overhead - Refactor connector padding logic using vectorized masking instead of list-based ops

sayakpaul

Thanks for the PR. Please provide comments inline to the changes explaining how they eliminate the syncs.

sayakpaul · 2026-05-01T06:04:03Z

-        patch_size = (self.patch_size_t, self.patch_size, self.patch_size)
-        patch_size_delta = torch.tensor(patch_size, dtype=grid.dtype, device=grid.device)
-        patch_ends = grid + patch_size_delta.view(3, 1, 1, 1)
+        patch_size_delta = torch.stack(
+            [
+                grid.new_ones(1) * self.patch_size_t,
+                grid.new_ones(1) * self.patch_size,
+                grid.new_ones(1) * self.patch_size,
+            ]
+        ).reshape(3, 1, 1, 1)
+        patch_ends = grid + patch_size_delta


This refactor seems unnecessary.

This replaces host-side tensor construction with device-native ops to eliminate an implicit cudaStreamSynchronize (~60ms).

This refactor eliminates a CPU --> GPU sync (since the patch_size tuple lives on the CPU host, torch.tensor needs to copy it to GPU, which the refactor avoids by doing the operations on the GPU), but because the tuple is really small, it looks like the corresponding cudaStreamSynchronize block (and that of the similar scale_tensor refactor below) takes about 2ms. I think most of the removed non-scheduler sync time is in the connectors.py refactor.

sayakpaul · 2026-05-01T06:05:07Z

-        scale_tensor = torch.tensor(self.scale_factors, device=latent_coords.device)
+        scale_tensor = torch.stack([latent_coords.new_ones(1) * factor for factor in self.scale_factors])


Also avoids implicit cudaStreamSynchronize (~60ms) by replacing torch.tensor(...) with device-native tensor construction.

sayakpaul · 2026-05-01T06:06:05Z


        # 5. Convert sigmas and timesteps to tensors and move to specified device
-        sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32, device=device)
+        sigmas = torch.from_numpy(sigmas).pin_memory().to(dtype=torch.float32, device=device, non_blocking=True)


This is a hard no. We cannot be pinning memory and running it async within the scheduler.

@sayakpaul Should I provide an alternative implementation to avoid the cudaStreamSynchronize, or is this sync considered acceptable in this case?

sayakpaul · 2026-05-01T06:06:37Z

+            self.scheduler.set_begin_index(0)
+            audio_scheduler.set_begin_index(0)


Move it out of the set_begin_index hasattr check.

HuggingFaceDocBuilderDev · 2026-05-01T06:39:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-05-06T01:06:42Z

@bot /style

github-actions · 2026-05-06T01:07:08Z

Style bot fixed some files and pushed the changes.

dg845

Thanks for the PR! I agree with #13564 (review) that having comments for the changes would be helpful.

github-actions Bot added models pipelines schedulers size/S PR with diff < 50 LOC labels Apr 26, 2026

sayakpaul mentioned this pull request Apr 30, 2026

Help us profile important pipelines and improve if needed #13401

Open

dg845 requested review from dg845 and sayakpaul May 1, 2026 05:18

sayakpaul reviewed May 1, 2026

View reviewed changes

Merge branch 'main' into ltx2pipelinespeedup

9891781

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels May 1, 2026

Merge branch 'main' into ltx2pipelinespeedup

64d19cb

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels May 5, 2026

Apply style fixes

f71dc9f

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels May 6, 2026

dg845 reviewed May 6, 2026

View reviewed changes

		scale_tensor = torch.tensor(self.scale_factors, device=latent_coords.device)
		scale_tensor = torch.stack([latent_coords.new_ones(1) * factor for factor in self.scale_factors])

		self.scheduler.set_begin_index(0)
		audio_scheduler.set_begin_index(0)

Conversation

ViktoriiaRomanova commented Apr 26, 2026

Pipeline Denoising Optimisation

Transformer Model Optimisation

Connector Refactoring

Performance Results

Profiler trace

Before submitting

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2026

Choose a reason for hiding this comment

Uh oh!

ViktoriiaRomanova May 4, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2026

Choose a reason for hiding this comment

Uh oh!

ViktoriiaRomanova May 4, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2026

Choose a reason for hiding this comment

Uh oh!

ViktoriiaRomanova May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2026

Uh oh!

dg845 commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ViktoriiaRomanova May 4, 2026 •

edited

Loading

github-actions Bot commented May 6, 2026 •

edited

Loading