Git commit
3d6064b
Operating System & Version
Windows 11 25H2
GGML backends
CUDA
Command-line arguments used
sd-cli.exe --diffusion-model "D:\AI\anima\split_files\diffusion_models\anima-preview3-base.safetensors" --vae "D:\AI\anima\split_files\vae\qwen_image_vae.safetensors" --llm "D:\AI\anima\split_files\text_encoders\qwen_3_06b_base.safetensors" -p "a lovely cat holding a sign says 'anima.cpp'" --cfg-scale 4.5 --fa -H 1024 -W 1024 --steps 20 --sampling-method euler_a --scheduler sgm_uniform -v
Steps to reproduce
Just clone and build with -DSD_CUDA=ON
What you expected to happen
Inference with high GPU usage.
I have a Linux on the same machine. I built sd.cpp on Linux and use the same command.
On Linux my GPU usage stays more than 80% until inference finished, get 4s/it default or 1s/it with --type f16
What actually happened
On Windows, every steps triggers graph has different number of nodes and reallocating buffers automatically.
So my GPU works with 1 second 90% and 3 seconds idle, waiting for reallocation.
The first step is 2s/it, and then 6s/it, and then more than 10s/it.
Logs / error messages / stack trace
[DEBUG] ggml_extend.hpp:1883 - anima compute buffer size: 206.05 MB(VRAM)
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_reserve_n_impl: reallocating CUDA0 buffer from size 206.05 MiB to 206.06 MiB
|==> | 1/20 - 2.07s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|=====> | 2/20 - 6.88s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|=======> | 3/20 - 8.73s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|==========> | 4/20 - 9.64s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|============> | 5/20 - 10.16s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|===============> | 6/20 - 10.49s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|=================> | 7/20 - 10.72s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|====================> | 8/20 - 11.01s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|======================> | 9/20 - 11.15s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|=========================> | 10/20 - 11.21s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
|===========================> | 11/20 - 11.30s/it[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_needs_realloc: graph has different number of nodes
[DEBUG] ggml_extend.hpp:58 - ggml_gallocr_alloc_graph: reallocating buffers automatically
^C
Additional context / environment details
CPU models: Intel Ivy Bridge which is lack of AVX2 (the pre-built binary will crash)
GPU: NVIDIA RTX 2080Ti with 22GB VRAM
The behavior is similar whether no quantization or f16 or q8_0.
With or without --fa also no impact.
Compile with CUDA 13.2, VS Build tools 2026, cmake 4.3.2 and ninja 1.13.2.
Git commit
3d6064b
Operating System & Version
Windows 11 25H2
GGML backends
CUDA
Command-line arguments used
sd-cli.exe --diffusion-model "D:\AI\anima\split_files\diffusion_models\anima-preview3-base.safetensors" --vae "D:\AI\anima\split_files\vae\qwen_image_vae.safetensors" --llm "D:\AI\anima\split_files\text_encoders\qwen_3_06b_base.safetensors" -p "a lovely cat holding a sign says 'anima.cpp'" --cfg-scale 4.5 --fa -H 1024 -W 1024 --steps 20 --sampling-method euler_a --scheduler sgm_uniform -v
Steps to reproduce
Just clone and build with
-DSD_CUDA=ONWhat you expected to happen
Inference with high GPU usage.
I have a Linux on the same machine. I built sd.cpp on Linux and use the same command.
On Linux my GPU usage stays more than 80% until inference finished, get 4s/it default or 1s/it with
--type f16What actually happened
On Windows, every steps triggers
graph has different number of nodesandreallocating buffers automatically.So my GPU works with 1 second 90% and 3 seconds idle, waiting for reallocation.
The first step is 2s/it, and then 6s/it, and then more than 10s/it.
Logs / error messages / stack trace
Additional context / environment details
CPU models: Intel Ivy Bridge which is lack of AVX2 (the pre-built binary will crash)
GPU: NVIDIA RTX 2080Ti with 22GB VRAM
The behavior is similar whether no quantization or f16 or q8_0.
With or without
--faalso no impact.Compile with CUDA 13.2, VS Build tools 2026, cmake 4.3.2 and ninja 1.13.2.