Skip to content

ci(linux): build fat package with GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS#5

Open
Geramy wants to merge 1 commit intolemonadefrom
geramy/cpu-all-variants
Open

ci(linux): build fat package with GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS#5
Geramy wants to merge 1 commit intolemonadefrom
geramy/cpu-all-variants

Conversation

@Geramy
Copy link
Copy Markdown
Member

@Geramy Geramy commented May 6, 2026

What

Switch the Linux x86_64 `ubuntu-latest-cmake` (cpu) and `ubuntu-latest-rocm` (HIP) builds from the AVX2/FMA/F16C portable baseline (#3) to a fat-package build:

```
-DGGML_NATIVE=OFF
-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON
```

The build now produces:

  • `libstable-diffusion.so` — main shared library
  • `libggml-cpu-sandybridge.so` — AVX
  • `libggml-cpu-haswell.so` — AVX2 + FMA + F16C
  • `libggml-cpu-skylakex.so` — + AVX-512F
  • `libggml-cpu-icelake.so` — + AVX-512 VNNI
  • `libggml-cpu-alderlake.so` — + AVX-VNNI + DOTPROD
  • `libggml-cpu-x64.so` — no-SIMD fallback

At runtime, `ggml_backend_load_all_from_path` (already wired by upstream PR leejet#1448) dlopens each variant, queries `__builtin_cpu_supports`, and picks the highest-tier match. Same zip works on a 2014 Sandy Bridge laptop and a 2024 Alder Lake server — without the runner-of-the-day AVX-512 lottery that crashed master-593.

Tradeoff

Linux x86_64 zip: ~12 MB → ~50–80 MB. Acceptable IMO — Lemonade and similar consumers cache the extracted dir across model loads, so this is paid once at install. In exchange you stop choosing between portability and AVX-512 perf — you get both.

Why not Windows / macOS

  • Windows AVX2 build already pins `GGML_NATIVE=OFF -DGGML_AVX2=ON`. Could get the same fat-package treatment for symmetry, but that's a separate change and not required to fix the consumer-side AVX-512 SIGILL pattern.
  • macOS arm64 all-Apple-Silicon parts share a uniform NEON+DOTPROD+i8mm+bf16 baseline (M1+), so `-march=native` does not introduce a portability problem there. Current Metal flags are already optimal.

Verification plan

  • Trigger a workflow_dispatch release on this branch (`create_release: true`).
  • Inspect the resulting `sd-{hash}-bin-Linux-Ubuntu-24.04-x86_64.zip` — it should contain `libstable-diffusion.so` plus several `libggml-cpu-*.so` files.
  • On an AVX-512-less host: `./sd-server -m model.safetensors` → uses haswell variant, no SIGILL.
  • On an AVX-512 host: same command → uses skylakex/icelake/alderlake variant, perf matches a native-compiled build.
  • Bump `sd-cpp` pins on lemonade-sdk/lemonade PR #1777 to the new tag and confirm `Test ollama (ubuntu-latest)` passes.

Lineage

Replaces #3's portable AVX2 baseline. #3 fixed the SIGILL but left AVX-512-class hosts running AVX2 code; this gets full perf on those hosts.

Replace the AVX2/FMA/F16C portable baseline (#3) with a fat-package build
that produces one libstable-diffusion.so plus a libggml-cpu-*.so per CPU
variant — sandybridge, haswell, skylakex (AVX-512F), icelake (AVX-512 +
VNNI), alderlake (AVX-512 + VNNI + DOTPROD), and a pure-x64 fallback.

At runtime ggml dlopens the variants and picks the highest-tier one the
host CPU supports. AVX-512 hosts get AVX-512 perf; older boxes fall back
gracefully — no -march=native runner lottery, no SIGILL.

Tradeoff: zip grows from ~12 MB → ~50–80 MB. Acceptable for a one-time
download, especially since downstream consumers (Lemonade) cache the
extracted directory across model loads.

Applied to ubuntu-latest-cmake (CPU) and ubuntu-latest-rocm (HIP), since
the HIPBLAS build still uses ggml CPU ops for parts of the pipeline.

Windows AVX2 already pins GGML_NATIVE=OFF + AVX2 only, and macOS arm64
shares a uniform NEON+DOTPROD+i8mm+bf16 baseline across all Apple Silicon
generations, so neither needs the same treatment.

Upstream PR leejet#1448 (commit b8079e2) wired the
runtime backend discovery code into libstable-diffusion.so already; this
just enables the build flag that produces the variant .so files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant