Skip to content

Conversation

@loci-dev
Copy link

Mirrored from leejet/stable-diffusion.cpp#1195

Since we've gone so far as to support SDXS, now it's only a small step to also support Segmind's Vega model.
This is another gift for users of small devices like RaspberryPi etc.

There are only minor changes to the code, and they don't affect other models,
except for SDXS. Here I changed the test on the U-Net block (in model.cpp) from
diffusion_model.output_blocks.7.1
to
diffusion_model.output_blocks.3.1.transformer_blocks.1
so that both models can easy be distinguished with one single test.

Thank you very much.

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 16, 2026 10:41 — with GitHub Actions Inactive
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Now I'll generate the comprehensive performance review report based on all the gathered information.


Performance Review Report

Overview

This review analyzes performance changes between two versions of the stable-diffusion.cpp project following commits 97255f9 and 75c0b7f, which add support for the Segmind-Vega distilled model. The changes modified 5 files, added 3 new files, and deleted 3 files across two binaries: build.bin.sd-server and build.bin.sd-cli.

Power Consumption Impact

  • sd-server: 0.113% increase (498,292.69 → 498,856.43 nJ)
  • sd-cli: 0.013% increase (469,148.81 → 469,209.21 nJ)

Total power consumption increased by approximately 624 nanojoules, representing negligible energy impact across both binaries.

Performance Analysis

Intentional Feature Addition

The primary source code change was adding VERSION_SDXL_VEGA support to the sd_version_is_sdxl() function, which now checks five SDXL variants instead of four. This function appears in multiple compilation units and shows consistent performance impact:

  • Absolute increase: +10.7ns per call
  • Percentage increase: ~18.8%
  • Justification: The additional conditional check is necessary for Segmind-Vega model classification and routing to appropriate SDXL-specific inference paths

This change is functionally required and the performance cost is minimal given the function's sub-70ns execution time and role as a lightweight classifier used 25 times across the codebase.

Compiler-Level Variations

The majority of performance changes stem from compiler optimization differences rather than source code modifications:

Improvements:

  • f8_e4m3_to_f16: -212ns (18% faster) from consolidated entry blocks in FP8-to-FP16 conversion, critical for quantized model inference
  • std::vector::begin() (httplib): -181ns (68% faster) from eliminated intermediate branching in HTTP routing path
  • std::sub_match::_M_str: -186ns (49% faster) from entry block consolidation in regex operations

Regressions:

  • std::vector::end() (thread): +183ns (227% slower) from added control flow indirection
  • std::vector::_S_max_size: +212ns (151% slower) from unnecessary unconditional branch
  • std::vector::_M_swap_data: +73ns (43% slower) from extra basic blocks in Flux modulation vector operations

These STL function changes show no source code modifications and represent compiler code generation differences, likely from optimization flag changes, compiler version updates, or security instrumentation adjustments between builds.

Mixed Optimization

The shared_ptr::operator= assignment for scheduler polymorphism shows favorable trade-offs: +80ns response time (8.3%) but +103% throughput improvement, indicating better instruction cache locality from code reorganization.

Conclusion

The performance changes reflect intentional feature enhancement (Segmind-Vega support) with acceptable overhead and compiler-level optimizations that produce mixed results. The net power consumption increase of 0.113% for sd-server and 0.013% for sd-cli is negligible. The absolute timing changes range from -212ns to +212ns, which are insignificant in the context of ML inference workloads that operate in millisecond-to-second timescales. The code changes successfully enable new model variant support while maintaining overall system efficiency.

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 17, 2026 07:35 — with GitHub Actions Inactive
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

@loci-dev loci-dev force-pushed the master branch 7 times, most recently from 243db15 to 436639f Compare January 23, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants