Skip to content

UPSTREAM PR #1322: feat: add spectrum caching method#78

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1322-spectrum
Open

UPSTREAM PR #1322: feat: add spectrum caching method#78
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1322-spectrum

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 6, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1322

Yet another training-free acceleration method. This PR implements Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration, currently for UNet only. For DiT models, we already have enough options in my view

This could replace and deprecate ucache, which was only ever an experimental method

Example usage:

/build/bin/sd-cli -m models/model.safetensors -p "a cute cat" --steps 20 -H 1024 -W 1024 --fa -s 42 --cache-mode spectrum --scheduler simple --sampling-method euler
Steps Baseline Spectrum
20 output_20_baseline output_20_spectrum
×1.0 ×1.82
30 output_30_baseline output_30_spectrum
×1.0 ×2.14
40 output_40_baseline output_40_spectrum
×1.0 ×2.50

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 6, 2026 04:14 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 6, 2026

Overview

Analysis of commit e2a3d0c ("add spectrum") comparing base and target versions across build.bin.sd-server and build.bin.sd-cli binaries. Total functions: 49,806 (120 modified, 66 new, 0 removed). Power consumption increased minimally: build.bin.sd-server +0.25% (527,149 nJ → 528,461 nJ), build.bin.sd-cli +0.22% (491,453 nJ → 492,534 nJ).

Function Analysis

sd_cache_params_init (both binaries): Response and throughput time increased +94ns (+58%), adding initialization for 7-8 Spectrum caching parameters (spectrum_w, spectrum_m, spectrum_lam, spectrum_window_size, spectrum_flex_window, spectrum_warmup_steps, spectrum_stop_percent). This one-time initialization cost enables step-skipping optimization with potential 2-3x inference speedup.

Standard library improvements: std::vector::begin() (sd-server) improved -181ns (-68%), std::vector::begin() (sd-cli) -181ns (-68%), std::basic_string::_M_set_length() (sd-cli) -77ns (-54%), and std::vector::_S_max_size() (sd-cli) -203ns (-63%). These compiler optimizations benefit text processing and Spectrum's history buffer operations.

Standard library regressions: std::vector<TensorStorage*>::end() (sd-server) +183ns (+227%), __gnu_cxx::__ops::__pred_iter (sd-cli) +169ns (+213%), and std::shared_ptr::_M_destroy() (sd-server) +189ns (+180%). No source code changes—regressions stem from compiler optimization variations in non-critical paths (model loading, backend management, cleanup operations).

Other analyzed functions showed sub-microsecond changes in non-critical paths including HTTP utilities and metadata accessors.

Additional Findings

Spectrum caching targets the diffusion denoising loop—the primary performance bottleneck—by predicting when steps can be skipped. No GPU kernels or GGML operations were negatively impacted. The 94ns initialization overhead is negligible compared to potential millisecond-scale savings from skipping expensive denoising steps. String operation improvements particularly benefit CLIP tokenization and T5 encoding pipelines.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-review
Copy link

loci-review bot commented Mar 8, 2026

Overview

Analysis of 49,812 functions (126 modified, 66 new, 0 removed) across two binaries shows minimal performance impact from Spectrum cache implementation.

Power Consumption:

  • build.bin.sd-server: +0.164% (+863.41 nJ)
  • build.bin.sd-cli: +0.342% (+1,681.77 nJ)

Function Analysis

Intentional Feature Addition:

  • sd_cache_params_init (both binaries): +94ns (+58%) - Initializes 7 new Spectrum cache parameters (spectrum_w, spectrum_m, spectrum_lam, spectrum_window_size, spectrum_flex_window, spectrum_warmup_steps, spectrum_stop_percent). One-time setup cost enabling intelligent denoising step-skipping during inference.

Compiler-Induced Changes (STL functions, no source modifications):

  • std::vector::end() (both binaries): +183ns (+227-307%) - Added indirect jump pattern at entry
  • std::vector::begin() (both binaries): -181ns (-68-74%) - Optimized block consolidation (9→7 blocks)
  • std::shared_ptr::_M_destroy (LCMScheduler): +189ns (+61%) - Extra branching indirection at entry
  • std::basic_string::_M_set_length: -77ns (-41-54%) - Entry block optimization
  • std::vector::_S_max_size: -203ns (-57-63%) - Dead code elimination

Other analyzed functions (arange, T5Runner::get_desc, chrono::operator-, all_of, basic_string::_M_disjunct) showed compiler-generated code layout changes with minimal real-world impact.

Additional Findings

All impacted functions are outside the critical denoising loop. The 94ns initialization overhead enables runtime step-skipping optimization (potential 10-30% inference speedup). Compiler optimizations in some STL functions (-181ns to -203ns) partially offset regressions in others (+183ns to +189ns), resulting in negligible net impact (~519ns total across all functions, representing 0.000002-0.0000052% of typical inference time).

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants