UPSTREAM PR #1322: feat: add spectrum caching method#78
UPSTREAM PR #1322: feat: add spectrum caching method#78
Conversation
OverviewAnalysis of commit e2a3d0c ("add spectrum") comparing base and target versions across build.bin.sd-server and build.bin.sd-cli binaries. Total functions: 49,806 (120 modified, 66 new, 0 removed). Power consumption increased minimally: build.bin.sd-server +0.25% (527,149 nJ → 528,461 nJ), build.bin.sd-cli +0.22% (491,453 nJ → 492,534 nJ). Function Analysissd_cache_params_init (both binaries): Response and throughput time increased +94ns (+58%), adding initialization for 7-8 Spectrum caching parameters (spectrum_w, spectrum_m, spectrum_lam, spectrum_window_size, spectrum_flex_window, spectrum_warmup_steps, spectrum_stop_percent). This one-time initialization cost enables step-skipping optimization with potential 2-3x inference speedup. Standard library improvements: std::vector::begin() (sd-server) improved -181ns (-68%), std::vector::begin() (sd-cli) -181ns (-68%), std::basic_string::_M_set_length() (sd-cli) -77ns (-54%), and std::vector::_S_max_size() (sd-cli) -203ns (-63%). These compiler optimizations benefit text processing and Spectrum's history buffer operations. Standard library regressions: std::vector<TensorStorage*>::end() (sd-server) +183ns (+227%), __gnu_cxx::__ops::__pred_iter (sd-cli) +169ns (+213%), and std::shared_ptr::_M_destroy() (sd-server) +189ns (+180%). No source code changes—regressions stem from compiler optimization variations in non-critical paths (model loading, backend management, cleanup operations). Other analyzed functions showed sub-microsecond changes in non-critical paths including HTTP utilities and metadata accessors. Additional FindingsSpectrum caching targets the diffusion denoising loop—the primary performance bottleneck—by predicting when steps can be skipped. No GPU kernels or GGML operations were negatively impacted. The 94ns initialization overhead is negligible compared to potential millisecond-scale savings from skipping expensive denoising steps. String operation improvements particularly benefit CLIP tokenization and T5 encoding pipelines. 🔎 Full breakdown: Loci Inspector |
OverviewAnalysis of 49,812 functions (126 modified, 66 new, 0 removed) across two binaries shows minimal performance impact from Spectrum cache implementation. Power Consumption:
Function AnalysisIntentional Feature Addition:
Compiler-Induced Changes (STL functions, no source modifications):
Other analyzed functions (arange, T5Runner::get_desc, chrono::operator-, all_of, basic_string::_M_disjunct) showed compiler-generated code layout changes with minimal real-world impact. Additional FindingsAll impacted functions are outside the critical denoising loop. The 94ns initialization overhead enables runtime step-skipping optimization (potential 10-30% inference speedup). Compiler optimizations in some STL functions (-181ns to -203ns) partially offset regressions in others (+183ns to +189ns), resulting in negligible net impact (~519ns total across all functions, representing 0.000002-0.0000052% of typical inference time). 🔎 Full breakdown: Loci Inspector |
Note
Source pull request: leejet/stable-diffusion.cpp#1322
Yet another training-free acceleration method. This PR implements Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration, currently for UNet only. For DiT models, we already have enough options in my view
This could replace and deprecate
ucache, which was only ever an experimental methodExample usage:
/build/bin/sd-cli -m models/model.safetensors -p "a cute cat" --steps 20 -H 1024 -W 1024 --fa -s 42 --cache-mode spectrum --scheduler simple --sampling-method euler