Skip to content

Conversation

@loci-dev
Copy link

Mirrored from leejet/stable-diffusion.cpp#1222

for cases like leejet/stable-diffusion.cpp#1220

The cli should no longer save an empty image to disk if the generation failed. If no imge is sucessfuly saved, it should exit with error code -1 instead of 0. (Maybe I should change it to return -1 if any image failed to be saved?)

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 21, 2026 17:49 — with GitHub Actions Inactive
@loci-agentic-ai
Copy link

Performance Review Report: Stable Diffusion C++ - VAE Error Handling Improvements

Impact Classification: Moderate

Commit: 6aa3b9d "improve handling of VAE decode failures"
Binaries Analyzed: build.bin.sd-cli, build.bin.sd-server
Functions Analyzed: 14 (3 performance-critical, 11 utility/STL)

Executive Summary

The target version introduces error handling improvements for VAE decode operations, resulting in 7-8% response time increases (+4,600 nanoseconds absolute) in three GPU-critical functions. The changes prioritize production reliability over raw performance, with negligible user-facing impact (<0.001% of total generation time).

Performance-Critical Functions Impacted

1. StableDiffusionGGML::preview_image() [both binaries]

  • Base: 56,279-56,385 ns → Target: 60,881-60,977 ns
  • Change: +4,591-4,601 ns (+8.14-8.18%)
  • Called 20-50 times per generation for real-time previews

2. StableDiffusionGGML::decode_first_stage() [sd-cli]

  • Base: 63,713 ns → Target: 68,312 ns
  • Change: +4,599 ns (+7.22%)
  • Critical VAE decoder for final latent-to-image conversion

3. UpscalerGGML::upscale() [sd-cli]

  • Base: 61,032 ns → Target: 65,640 ns
  • Change: +4,608 ns (+7.55%)
  • ESRGAN 4x super-resolution with 345 convolution operations

Code Changes and Justification

All three functions received identical error handling enhancements:

  • Added return statements in tiling lambda callbacks to propagate compute() status
  • Implemented explicit error checking after VAE/TAE compute operations
  • Added free_compute_buffer() calls on error paths to prevent GPU memory leaks
  • Added memset initialization for result buffers

These changes prevent silent GPU failures, enable graceful degradation under memory constraints, and ensure proper resource cleanup—critical for production ML inference systems.

Cumulative Impact Analysis

Typical Generation (20 steps):

  • Preview overhead: 20 × 4.6 μs = 92 μs
  • Final decode overhead: 4.6 μs
  • Total: ~97 μs on 10-30 second generation (0.0003-0.001%)

Power Consumption: Negligible increase (<0.01W), imperceptible in 50-150W GPU inference workloads.

Additional Findings

Eleven STL functions show compiler-driven optimizations (no source changes) with 23-75% improvements in various metrics, providing distributed performance gains that partially offset error handling overhead. Notable: std::vector operations improved 40-75%, std::unordered_map::find() gained 44% throughput.

Conclusion

The 4.6 microsecond overhead per GPU operation is justified by substantial reliability improvements. The changes prevent crashes, memory leaks, and silent failures while adding sub-millisecond latency to multi-second operations. This represents mature engineering prioritizing correctness in production ML systems.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@noamshiovits-Aurora
Copy link

@loci-dev whats the most substantial change in this PR

@loci-dev loci-dev force-pushed the master branch 2 times, most recently from 243db15 to 436639f Compare January 23, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants