Cache ModelMixin.dtype to avoid named_parameters walk per access by akshan-main · Pull Request #13571 · huggingface/diffusers

akshan-main · 2026-04-28T01:54:25Z

What does this PR do?

Addresses #13401

ModelMixin.dtype calls get_parameter_dtype() which walks named_parameters() on every access. Pipelines call self.transformer.dtype / self.text_encoder.dtype / self.vae.dtype inside their denoise loops, so the walk fires every step.

This PR caches the dtype on first access and invalidates via _apply (which .to(), .cpu(), .cuda(), .half(), .bfloat16() etc. all flow through). One small change benefits every pipeline that subclasses ModelMixin.

device is intentionally not cached: with group offloading, the effective device changes per-forward as groups onload/offload. Caching it would break that flow.

Same shape of fix as the centralized cache_context._set_context cache in #13356.

Cache returns the same torch.dtype value get_parameter_dtype() would return; generation outputs are bit-identical.
.to() / .cpu() / .cuda() / .half() / .bfloat16() all flow through nn.Module._apply, so the cache is invalidated correctly when the actual dtype changes.
Microbench on AutoencoderKL: 87.81us → 0.09us per .dtype access (963x).

Profiling - surveyed across 10 pipelines (eager, 2 inference steps, H100)

pipeline                | get_parameter_dtype calls | get_parameter_dtype total (ms) | inter-step gap (ms)  | pipeline_call (ms)

                        |   before        after     |    before        after         |  before    after     |  before     after

flux2                   |        2            0     |     2.20         0.00          |   1.34     0.28      |  191.76    185.81
qwenimage(full_decode)  |        1            0     |     1.07         0.00          |   0.09     0.09      | 1430.86   1421.92
qwenimage_edit(fd)      |        1            0     |     1.04         0.00          |   0.12     0.13      | 3771.11   3772.56
z_image                 |        2            0     |     4.21         0.00          |   5.46     3.51      |  852.15    845.42
chroma                  |        0            0     |     0.00         0.00          |   0.05     0.06      | 1501.74   1499.49
sdxl(full_decode)       |        2            0     |     1.80         0.00          |   0.00     0.00      |  328.60    330.17
sana                    |        1            0     |     1.47         0.00          |  21.44    21.52      |  132.70    130.65
hunyuanv15 (video)      |        5            0     |    30.95         0.00          |   0.09     0.08      |12884.55  12862.06
wan2.2 (video)          |        1            0     |     2.79         0.00          |   0.06     0.06      | 1891.21   1866.68
ltx2 (video)            |        0            0     |     0.00         0.00          | 166.77   166.82      | 3682.21   3679.66

The fix removes the walk wherever it appears (most impact on hunyuanv15: 30.95ms at 2 inference steps; scales linearly with num_inference_steps). On pipelines where the walk doesn't appear (chroma, ltx2), there is no regression. Fix is a no-op there.

Reproduction notebook (Colab) - applies the central fix, profiles every pipeline before and after, consolidated table at bottom of notebook.

Before submitting

This PR fixes a typo or improves the docs.
Did you read the contributor guideline?
Did you read our philosophy doc?
Was this discussed/approved via a GitHub issue or the forum?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@sayakpaul @dg845

akshan-main · 2026-04-28T13:50:03Z

Profiled SD3 too (eager + compile, RTX PRO 6000 Blackwell, 2 steps) following the profiling guide.
Notebook: https://colab.research.google.com/gist/akshan-main/cb9ee83575806704e93e03496ba0d940/sd3_profiling.ipynb

Denoising loop is clean. 0 syncs in in_loop_body, in_transformer_forward, or in_scheduler_step after set_begin_index(0).

Pre-loop has 2x ~10ms aten::copy_ from scheduler.set_timesteps (numpy to GPU sigmas) and _get_clip_prompt_embeds (tokenizer ids to GPU). One 62ms aten::nonzero in the first _init_step_index call which set_begin_index(0) eliminates.

Tested adding set_begin_index(0) (matches Flux/Wan/Flux2). Trace sync drops from 62ms to 0 but wall-clock is within noise:

Mode	Before	After	Delta
Eager	233.0 ± 0.6 ms	231.9 ± 1.0 ms	-1.1 ms
Compile	200.2 ± 0.3 ms	199.9 ± 0.3 ms	-0.3 ms

The sync was queue-drain. GPU has to do that work anyway, CPU just doesn't wait for it. Unlike Z-Image #13461, no per-step .item()/.cpu() to chase here. Remaining pre-loop syncs are legitimate one-time copies. Not opening a PR for SD3 profile.

@sayakpaul @dg845

akshan-main · 2026-05-01T16:15:20Z

friendly ping @DN6 @yiyixuxu would love a review today if time permits for you guys today!

akshan-main · 2026-05-06T00:28:48Z

friendly ping @yiyixuxu @DN6 hey guys can this be reviewed if possible

Cache dtype on ModelMixin to avoid named_parameters() walk per access

2a031d1

github-actions Bot added models size/S PR with diff < 50 LOC labels Apr 28, 2026

akshan-main mentioned this pull request Apr 28, 2026

Help us profile important pipelines and improve if needed #13401

Open

sayakpaul requested review from DN6 and yiyixuxu April 30, 2026 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache ModelMixin.dtype to avoid named_parameters walk per access#13571

Cache ModelMixin.dtype to avoid named_parameters walk per access#13571
akshan-main wants to merge 1 commit intohuggingface:mainfrom
akshan-main:cache-modelmixin-dtype

akshan-main commented Apr 28, 2026 •

edited

Loading

Uh oh!

akshan-main commented Apr 28, 2026 •

edited

Loading

Uh oh!

akshan-main commented May 1, 2026 •

edited

Loading

Uh oh!

akshan-main commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akshan-main commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Profiling - surveyed across 10 pipelines (eager, 2 inference steps, H100)

Before submitting

Who can review?

Uh oh!

akshan-main commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshan-main commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshan-main commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akshan-main commented Apr 28, 2026 •

edited

Loading

akshan-main commented Apr 28, 2026 •

edited

Loading

akshan-main commented May 1, 2026 •

edited

Loading