feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs by fszontagh · Pull Request #1477 · leejet/stable-diffusion.cpp

fszontagh · 2026-05-06T19:46:02Z

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs

Why

Two problems that come up on small GPUs running large diffusion models:

Cross-stage component placement. Where does the text encoder live while diffusion runs? Where does diffusion go while the VAE decodes? On a 12 GB card running an 11.5 GB diffusion model, we need to move components in and out between stages or VAE decode hits OOM.
Models that don't fit at all. When the diffusion weights themselves exceed VRAM, we need to stream them in per-layer rather than load all at once.

This PR adds a single new flag, --offload-mode, that handles cross-stage placement, plus a per-layer streaming path (--offload-mode layer_streaming) for the doesn't-fit-at-all case.

New CLI flags

Flag	Description
`--offload-mode <mode>`	One of `none`, `cond_only`, `cond_diffusion`, `aggressive`, `layer_streaming`. Default `none`.
`--offload-cond-stage` / `--no-offload-cond-stage`	Override the cond-stage offload decision.
`--offload-diffusion` / `--no-offload-diffusion`	Override the diffusion-model offload decision.
`--offload-log` / `--no-offload-log`	Log offload events to stderr.
`--vram-estimation <method>`	`dryrun` (probe graph) or `formula` (analytic).
`--streaming-prefetch <N>`	Layers to prefetch ahead during streaming. Default 1.
`--streaming-min-vram <MB>`	Minimum free VRAM kept during streaming. Default 512.

What each mode does

Mode	What it does	Use case
`none` (default)	No offload. Identical to current master behaviour.	Default; everything fits on GPU.
`cond_only`	Move text encoder to CPU after conditioning, keep diffusion on GPU.	Tight VRAM during diffusion.
`cond_diffusion`	Move both text encoder and diffusion model out between stages, swap them in for their stage.	VAE decode needs room; diffusion is too big to coexist with VAE compute buffer.
`aggressive`	Evict every component as soon as it's not actively used; reload on demand.	Lowest VRAM footprint at any moment; pays reload costs each transition.
`layer_streaming`	Diffusion weights live in pinned host RAM; each transformer block uploads to GPU just before it runs and is evicted afterwards. Async prefetch keeps PCIe full.	Models that don't fit at all (Z-Image bf16 11.5 GB on 12 GB card).

How layer streaming works

Three pieces, each a known-but-effective optimization at a different layer of the stack:

Pinned host buffer for streamed weights, so cudaMemcpyAsync actually goes async (a pageable source falls through to a synchronous bounce-buffer copy in the driver).
Per-layer prefetch overlapped with the previous layer's compute - the next layer's H2D starts on a separate stream while the current kernel is still running.
Chunk graph for the resident block - layers that fit on GPU stay there across sampling steps and run as one combined ggml graph dispatch instead of one mini-graph per layer.

A unified VRAM heuristic decides automatically which layers stay resident and which stream, based on actual free VRAM. Users don't have to pick a budget manually.

Benchmarks - RTX 3060 (12 GB), PCIe 3.0 x16

Hardware: RTX 3060 12 GB. The card itself supports PCIe 4.0, but the board is DDR3-era so the slot is capped at PCIe 3.0 x16 (8.0 GT/s). PCIe bandwidth is the dominant cost during streaming, so faster boards (PCIe 4.0 x16, ~24 GB/s practical) should reduce these numbers materially.

All numbers below: batch_count=4, steps=12, resolution=688x1024, LoRA applied at runtime, same prompt/seed across configs.

Z-Image-Turbo bf16 (11.5 GB diffusion model — does NOT fit in 12 GB)

Workload: 4 images per generation, 12 sampling steps each, batch=4. This is where streaming matters most — without offload of some kind, the model can't even load.

Config	generate_image	Notes
`--offload-mode layer_streaming`	175 s	This PR. GPU utilization steady >90%; effective PCIe TX ~3.5 GB/s during streaming windows.
`--offload-to-cpu --max-vram 9`	335 s	Existing graph-cut path. ~2× slower.

Z-Image-Turbo Q8 (6.7 GB diffusion model — fits in VRAM, but VAE compute buffer doesn't)

Workload: 4 images per generation, 12 sampling steps each, batch=4. When the model fits, streaming gives up most of its advantage and the simpler existing offload paths are slightly faster. Listed for completeness.

Config	generate_image	Notes
`--offload-to-cpu`	115 s	Fastest when model fits.
`--vae-tiling`	118 s	Tile VAE compute on GPU.
`--offload-mode layer_streaming`	122 s	Auto-picks coarse-stage; still goes through streaming bookkeeping (~6% overhead).
`--offload-to-cpu --max-vram 6`	152 s	Graph-cut adds dispatch overhead even when params fit.
`--vae-on-cpu`	602 s	Reference; VAE on CPU is brutal.

So the recommendation in the docs is: pick --offload-mode layer_streaming when the model doesn't fit (where it's ~2× faster than alternatives), and stick with the existing --offload-to-cpu (or no offload) when it does. --offload-mode none (default) keeps current master behaviour.

Architectures

The streaming runtime is shared via tensor_registry.hpp, layer_streaming.hpp, memory_budget.hpp. Verified end-to-end on RTX 3060:

Z-Image / Z-Image-Turbo (bf16 + Q8) - primary target
Flux schnell
Anima
Qwen Image

Implemented and built but not personally verified by me - appreciate someone with the hardware/models confirming:

MMDiT / SD3
UNet (SD1.x / SDXL)
WAN

Known issues

--lora-apply-mode immediately + --offload-mode layer_streaming crashes - the immediate folder reaches into weight buffers that haven't been uploaded to GPU yet under streaming. Use at_runtime (default auto already picks this in streaming mode). Pre-existing class of issue surfaced by streaming.
VRAM estimation isn't perfect; dryrun is more accurate but adds a small startup cost. Switch to dryrun if you hit OOM during the first step.

Backwards compatibility

Default behaviour is unchanged. --offload-mode none matches current master byte-for-byte. All new flags are opt-in.

Bug fixes folded in

While exercising the offload paths I found and fixed a small set of pre-existing bugs. They're independent of the new offload modes and benefit users who never set --offload-mode. Happy to split these into a separate small PR if preferred.

GGMLRunner destructor leaked runtime_params_buffer and partial_runtime_params_buffer. free_params_buffer() only released the CPU-side params_buffer. When the runner had been staged onto the runtime backend (any offload mode active, including the segmented offload from feat: add max-vram based segmented param offload #1476), the GPU-side weight buffer(s) leaked on destruction. Real leak under LoRA + offload — many short-lived runners are created during LoRA application. Two-line additions to the destructor.
CFG causing redundant model reloads under streaming.
t_emb buffer aliasing in Z-Image's per-layer path.
GGMLRunner scratch-buffer reuse.
VAE-encode OOM in aggressive mode.
Includes Skip empty MultiLoraAdapter when no LoRAs target a model #1469's empty-MultiLoraAdapter fix (already merged into master); will rebase to drop that commit at PR time.

Documentation

docs/vram_offloading.md covers the modes, decision tree, and example commands.

Add runtime tensor offloading to enable running large models (Q8+) on GPUs with limited VRAM by dynamically moving components between GPU and CPU memory. - `cond_only`: Offload cond_stage (LLM/CLIP) after conditioning - `cond_diffusion`: Offload both cond_stage and diffusion after use - `aggressive`: Offload each component immediately after use - Add OffloadConfig struct with mode, flags for cond_stage/diffusion - Add move_params_to_cpu/gpu methods to GGMLRunner - Add set_auto_offload() to control automatic offloading behavior - Implement on-demand reload before conditioning/diffusion steps - Track VRAM usage for offloaded components Enables 1024x1024 generation with Z-Image Q8 (~7GB) + Qwen3-4B Q8 (~4GB) + VAE (~320MB) on 12GB GPU by offloading the ~4GB LLM after conditioning completes, freeing VRAM for diffusion compute buffers. Without offloading: CUDA OOM during diffusion With cond_only offload: Successful generation in ~66s Tested configurations: - offload_mode=none: OOM at 1024x1024 with Q8 models - offload_mode=cond_only: Success, ~66s generation time - offload_mode=cond_only + vae_tiling: Success, ~149s

Expose the dynamic tensor offloading feature through CLI options: - --offload-mode: Set offload mode (none, cond_only, cond_diffusion, aggressive) - --offload-log: Enable offload event logging - --no-offload-log: Disable offload event logging The cond_only mode is particularly useful for 12GB GPUs running large Q8 models with LLMs, as it offloads the LLM/CLIP to CPU after conditioning, freeing VRAM for diffusion compute buffers. Changes: - Add sd_offload_mode_name() and str_to_offload_mode() helper functions - Add sd_offload_config_init() for default configuration - Add offload_config member to SDContextParams - Wire offload_config through to_sd_ctx_params_t() - Add CLI options in get_options()

When dynamic offloading is enabled and the LLM/CLIP model was offloaded to CPU, attempting to reload it to GPU could fail if there's not enough VRAM available. Previously, the code logged a misleading warning "conditioning will run on CPU (slower)" but then crashed (SEGV) because: 1. move_params_to_gpu() failed and returned false 2. Code continued to call get_learned_condition() 3. compute() tried offload_params_to_runtime_backend() which failed again 4. compute() returned false but caller didn't check return value 5. Code tried to use uninitialized data, causing SEGV Fix: - Return NULL from generate_image/generate_video when GPU reload fails - Return false from load() if initial GPU move fails - This gives callers a proper error to handle instead of crashing The user will see a clear error message suggesting to reduce resolution, use smaller models, or disable dynamic offloading.

When offload_mode is enabled and LoRAs are being applied, the cond_stage (LLM/CLIP) may still be on GPU from initial model loading. This uses up VRAM and causes LoRA allocation to fail with OOM. Fix: Before applying LoRAs in generate_image(), check if: 1. offload_mode is enabled 2. offload_cond_stage is true 3. We have LoRAs to apply 4. cond_stage is currently on GPU If all conditions are met, offload cond_stage to CPU first to free VRAM for LoRA allocation. The cond_stage will be reloaded on-demand before conditioning runs. This allows using LoRAs with large LLM models (like qwen3-4b) on 12GB GPUs that would otherwise OOM during LoRA allocation.

When cond_stage reload fails due to LoRA buffers using VRAM: 1. Free LoRA buffers to make room 2. Retry cond_stage reload 3. Reload LoRA weights from disk Added reload_params() method to LoraModel to support reloading weights after buffer is freed and reallocated. This enables using LoRA with cond_only offload mode on GPUs where cond_stage + LoRA can't both fit alongside diffusion model.

- Add enable_offload parameter to LoraModel constructor - Enable CPU offload for LoRA when dynamic offloading is active - Use move_params_to_cpu()/move_params_to_gpu() for fast memory transfers instead of free_params_buffer()/reload_params() disk I/O This makes LoRA offloading ~10-50ms instead of ~500-1000ms from disk.

When offload mode is enabled, GGMLRunner has both: - params_buffer (CPU) - runtime_params_buffer (GPU) The destructor only freed params_buffer, causing GPU memory to leak when LoRA models were destroyed while on GPU. This caused OOM errors after multiple generations with LoRAs.

- Add sd_vram_estimation_t enum for estimation method selection - SD_VRAM_EST_DRYRUN (default): accurate graph-based estimation - SD_VRAM_EST_FORMULA: faster formula-based approximation - Add estimate_compute_buffer_size() to GGMLRunner for dry-run allocation that returns required buffer size without allocating - Add estimate_vae_decode_vram() to calculate VAE decode requirements using either dry-run or formula method - Add smart_offload_for_vae() that estimates VRAM needed and offloads only what's necessary before VAE decode - Call smart_offload_for_vae() before decode in image and video generation paths This enables smarter offloading - only offload components when actually needed based on accurate VRAM estimation.

- Add get_free_vram() helper to query actual GPU memory via CUDA - Add estimate_diffusion_vram() for diffusion sampling memory estimate - Add should_offload_cond_stage_for_diffusion() smart check - Add should_offload_diffusion_for_vae() smart check - Replace unconditional offload with VRAM-aware decisions - Only offload when free_vram < next_phase_needs + 300MB margin - Apply to both txt2img and img2img/video generation paths - Update common.hpp for vram_estimation struct field order On larger GPUs, components stay on GPU between phases for speed. On tight VRAM, offloading still occurs as needed.

- Add reload_diffusion field to sd_offload_config_t struct - Default to true (matches previous always-reload behavior) - Make post-generation reload of diffusion model respect config - Update both txt2img and video generation paths - Allows keeping diffusion offloaded between generations for batch work Benchmark results on 12GB GPU with Z-Image Q8_0: - no_reload: 29-30s generation, 1.9GB GPU after - reload: 32s generation, 8.1GB GPU after

New CLI options: - --offload-cond-stage / --no-offload-cond-stage - --offload-diffusion / --no-offload-diffusion - --reload-cond-stage / --no-reload-cond-stage - --reload-diffusion / --no-reload-diffusion - --vram-estimation [dryrun|formula] Also adds: - sd_vram_estimation_name() and str_to_vram_estimation() API functions - Extended toString() output showing all offload config details

This commit adds the foundation for layer-by-layer tensor streaming, enabling models larger than VRAM to run by loading weights on-demand. New components: - TensorRegistry: Tracks individual tensor locations (GPU/CPU) by layer - MemoryBudgetManager: Manages VRAM budget with eviction policies - LayerExecutionEngine: Orchestrates per-layer execution with prefetch Integration: - FluxRunner gains enable_layer_streaming() for streaming mode - New SD_OFFLOAD_LAYER_STREAMING offload mode - CLI: --offload-mode layer_streaming This is the infrastructure foundation. Per-block execution will be added in subsequent commits.

GGMLBlock stores tensor names in its internal `params` map hierarchy, but never calls ggml_set_name() on the actual GGML tensors. This caused register_from_context() to get empty names for all tensors, mapping everything to the "_global" layer (resulting in "registered 1 layers"). Fix: Add register_from_map() method that takes the tensor map from get_param_tensors(), which preserves proper tensor names like "model.diffusion_model.double_blocks.5.img_attn.qkv.weight". Result: 58 layers now registered correctly for Flux models (19 double_blocks + 38 single_blocks + 1 _global) instead of just 1.

…cking 1. Skip move_params_to_gpu() for diffusion model in layer_streaming mode - Before sampling: don't bulk-load entire diffusion model to GPU - After generation: don't reload diffusion in streaming mode 2. Fix tensor name tracking in TensorRegistry::move_layer_to_gpu - Use stored tensor names instead of relying on ggml_get_name() - GGMLBlock doesn't call ggml_set_name() on original tensors Known issue: Graph context invalidation in streaming path needs fixing (alloc_compute_buffer resets compute_ctx after graph is built)

Two critical fixes for layer streaming mode: 1. Flux preprocessing: Add to_backend() calls for input tensors - The regular build_graph() converts external tensors to compute_ctx - Streaming preprocessing was missing this, causing mul_mat assertions - Now properly converts x, context, timesteps, y, guidance to backend 2. UNet streaming: Add skip_param_offload parameter to compute() - In streaming mode, weights are managed by the streaming engine - The regular compute() was trying to bulk-allocate all weights to GPU - This failed with OOM because streaming only loads layers on demand - New skip_param_offload=true prevents this bulk allocation Testing: Successfully generated 512x512 image with SDXL model using --offload-mode layer_streaming, 4 steps completed in 3.78s

MMDiT has no skip connections, making it ideal for layer streaming: - Added mmdit_layer_pattern() to parse joint_blocks.N tensor names - Added streaming infrastructure to MMDiTRunner (enable/disable/compute) - Added compute_streaming() that loads all joint_blocks before execution - Wired MMDiTModel to DiffusionModel streaming interface MMDiT structure: - 24 joint_blocks (each with context_block + x_block) - Global tensors: x_embedder, t_embedder, y_embedder, context_embedder, final_layer

WAN has sequential transformer blocks ideal for streaming: - Added wan_layer_pattern() to parse blocks.N and vace_blocks.N tensor names - Added streaming infrastructure to WanRunner (enable/disable/compute) - Added compute_streaming() that loads all blocks before execution - Wired WanModel to DiffusionModel streaming interface WAN structure: - 30-40 blocks.N (main transformer blocks) - Optional vace_blocks.N (VACE interleaved blocks) - Global tensors: patch_embedding, text_embedding, time_embedding, head

- Add qwen_image_layer_pattern() for 60 transformer_blocks - Add zimage_layer_pattern() for context_refiner + noise_refiner + layers - Add streaming infrastructure to QwenImageRunner and ZImageRunner - Wire both models to DiffusionModel streaming interface - Update compute() methods to accept skip_param_offload parameter All 6 diffusion model architectures now support layer streaming.

- Add ref_latents and increase_ref_index parameters to compute_streaming - Update FluxModel::compute_streaming to pass ref_latents - Convert ref_latents to backend in preprocessing graph - Handle ref_latents patchification and concatenation Note: Flux streaming still has tensor context issue in preprocessing that needs investigation.

The per-layer mini-graph approach was architecturally broken because: 1. GGML tensors are bound to their compute context 2. alloc_compute_buffer() resets context internally 3. Intermediate results cannot be passed between separate graphs Changed to coarse-stage approach: 1. Load all model weights to GPU via streaming engine 2. Execute full compute graph with skip_param_offload=true 3. This matches the working UNet streaming implementation Also added skip_param_offload parameter to FluxRunner::compute()

In layer_streaming mode, the cond_stage (T5) must be offloaded before layer streaming begins, otherwise there won't be enough VRAM for the diffusion model layers. Changes: - Set free_params_immediately=false for layer_streaming mode in CLI This enables smart offload logic instead of immediate param freeing - Add explicit layer_streaming check in should_offload_cond_stage_for_diffusion() Forces T5 offload regardless of VRAM heuristics Without this fix, T5 (~9GB) stays on GPU while layer streaming tries to load Flux layers (~6.5GB), causing OOM on 12GB cards. Tested with Flux Schnell Q4_K + T5XXL fp16 on RTX 3060 12GB: - T5 properly offloaded after conditioning - Layer streaming loads all 58 layers successfully - Image generation completes without OOM

Implements the same coarse-stage layer streaming approach used by Flux, MMDiT, UNet, and other models for the new Anima diffusion model. Changes: - tensor_registry.hpp: Add anima_layer_pattern() for net.blocks.N extraction - anima.hpp: Add streaming engine, enable/disable/compute_streaming methods - diffusion_model.hpp: Add AnimaModel streaming wrapper methods Anima has 28 transformer blocks by default, similar in structure to other DiT models, making it a good candidate for VRAM offloading on memory-constrained systems.

AnimaConditioner: - Add GPU offloading methods (is_params_on_gpu, move_params_to_cpu, move_params_to_gpu, get_params_vram_size, set_auto_offload) delegating to underlying LLM - This enables proper VRAM management for Anima's Qwen3 text encoder Layer streaming state consistency: - Skip diffusion model state manipulation in layer_streaming mode - The TensorRegistry uses direct buffer pointer swapping which leaves GGMLRunner's internal state (params_on_runtime_backend) out of sync - Querying or manipulating diffusion offload state after streaming would cause crashes due to this inconsistency - cond_stage offload still works normally (not managed by streaming) Tested: Anima model generates identical output with and without layer_streaming enabled (verified via MD5 hash comparison)

Problem: After layer streaming completes, all diffusion model layers remain on GPU. For large models like QwenImage (8.6GB), this leaves insufficient VRAM for VAE decoding. Solution: Add offload_streaming_layers() method to all streaming-enabled models that moves all layers back to CPU before VAE decode. Changes: - Add offload_streaming_layers() to DiffusionModel base interface - Implement in all runners: UNet, MMDiT, Flux, Anima, Wan, QwenImage, ZImage - Add override methods in all Model wrapper classes - Call offload_streaming_layers() in stable-diffusion.cpp before VAE decode This enables running models larger than VRAM: - QwenImage Edit (16GB model) now runs on 12GB GPU via layer_streaming - Tested: Anima streaming produces identical output with ~1% overhead

- Add staged forward methods to QwenImageModel: - forward_input_stage(): patchify + input projections - forward_single_block(): execute one transformer block - forward_output_stage(): norm + proj + unpatchify - Implement compute_streaming_true() for QwenImage that: - Executes each of the 60 transformer blocks as a separate mini-graph - Stores intermediate img/txt tensors in CPU memory between blocks - Loads/offloads ~140MB per block during execution - Enables running 8.5GB+ models on 12GB VRAM GPUs - Update all model architectures (Flux, MMDiT, Anima, WAN, ZImage, UNet) with improved VRAM checking in compute_streaming() This is true per-layer streaming where only ONE block's weights plus activation memory is needed at any time, enabling models larger than available VRAM to run. Tested with Qwen-Image-Edit-2509-Q3_K_S.gguf (8.5GB) on RTX 3060 12GB.

…utput read Bug: When compute() was called with free_compute_buffer_immediately=true, the buffer holding output tensors was freed before ggml_backend_tensor_get() could read them, causing "CUDA error: invalid device ordinal". Fixes: 1. alloc_compute_buffer() now returns graph via out_gf parameter for reuse 2. compute() reuses graph from alloc_compute_buffer to avoid tensor mismatch 3. copy_data_to_backend_tensor() skips tensors without allocated buffers 4. All TRUE per-layer streaming stages now use free_compute_buffer_immediately=false and manually call free_compute_buffer() after reading outputs Affected models: Flux, MMDiT, Anima, UNet, ZImage, QwenImage

- Add estimate_vae_encode_vram() for VRAM estimation before encoding - Add smart_offload_for_vae_encode() to offload cond_stage and diffusion models before VAE encode operations - Call smart_offload_for_vae_encode() before all encode_first_stage() and vae_encode() calls across generate_image and generate_video paths: - img2img init image encoding - ref image encoding (for edit modes) - control net image encoding - video frame encoding (WAN, VACE, Anima) This prevents OOM during VAE encoding of large images by freeing VRAM from models not needed during the encode phase. With layer_streaming mode, this allows encoding images that previously caused OOM.

Key changes: - Add async prefetch methods to LayerExecutionEngine: prefetch_layer(), wait_for_prefetch(), wait_for_all_prefetches() - Add AsyncLoadState struct and async layer load methods to TensorRegistry: start_async_layer_load(), complete_async_layer_load() - Use ggml_backend_tensor_copy_async() to overlap memory transfers with GPU computation during TRUE per-layer streaming - Update qwen_image.hpp to start prefetching next block before computing current block, reducing GPU idle time - Fix sd_offload_config_t initialization with correct field order - Offload diffusion model layers to CPU at startup when layer_streaming mode is enabled, freeing VRAM for LLM/CLIP conditioning This enables overlapped memory transfers during per-layer streaming, reducing periodic GPU pauses caused by blocking PCIe transfers.

Adds async prefetching pattern to overlap PCIe memory transfer with GPU computation during layer streaming. Before computing each block, prefetch the next block's weights asynchronously. Models updated: - Flux: double_blocks and single_blocks loops - UNet: input_blocks and output_blocks loops - MMDiT: joint_blocks loop - ZImage: layers loop - Anima: blocks loop Note: WAN model doesn't have true per-layer streaming yet (uses full graph).

When using CFG (multiple model calls per diffusion step), the VRAM check didn't account for layers already loaded on GPU. This caused the second CFG call to see full VRAM and switch to slow TRUE per-layer streaming. Now tracks already_on_gpu and only checks remaining_to_load against available VRAM. Second+ CFG calls complete in ~0.15s instead of 3+ seconds. Applied to all 7 architectures: Flux, UNet, MMDiT, ZImage, Anima, WAN, QwenImage

@param

Extract common layer streaming infrastructure from 7 runners into GGMLRunner base class: init_streaming(), analyze_vram_budget(), load_all_layers_coarse(), is_streaming_enabled(), disable_layer_streaming(), offload_streaming_layers(), get_streaming_engine(). Each runner's enable_layer_streaming() is now ~4 lines and compute_streaming() ~20 lines. Remove streaming_enabled_ bool from all runners — standardize on checking engine config flag. Remove SDCPP_FORCE_TRUE_STREAMING and SDCPP_FORCE_COARSE_STREAMING debug env vars. Convert all Javadoc /** */ blocks to minimal // style and strip @param, @return, @brief tags across streaming infrastructure and runner files. Remove component prefixes from LOG calls: [LayerStreaming], [Offload], FluxRunner:, ZImageRunner:, MMDiTRunner:, UNetRunner:, WanRunner:, AnimaRunner:, QwenImageRunner:, IntermediateTensorManager:, LayerExecutionEngine:, MemoryBudgetManager:, TensorRegistry:.

Document all offload modes, layer streaming internals, supported architectures, usage examples, and quality impact of each technique.

Merge 34 upstream commits including sd::Tensor pipeline migration, fused SwiGLU kernel, sampler refactoring, VAE optimization, spectrum caching, webp support, and embedded WebUI. Added StreamingParamConverter bridge and raw-tensor build_graph/compute overloads to preserve all offloading/layer streaming infrastructure alongside upstream's new API.

Merge 38 upstream commits including sd-webui style Hires.fix support, DPM++ 2S A and er_sde samplers, ernie image and SDXS-09 model support, flux2 small decoder, restricted torch legacy checkpoint loading, and major refactors: tokenizer module split, model_io module, examples common split into header/source, async vid_gen API. Ported our offload configuration into the new SDContextParams in common.h/common.cpp.

Upstream's rewrite of the sample loop replaced the explicit streaming branch with a single compute() call, which routes to the bulk-allocate path and OOMs when the model exceeds VRAM. Add compute_dispatch() that selects compute_streaming() when layer streaming is enabled and bridges its ggml_tensor* output back into sd::Tensor<float> for the new sampler.

compute_dispatch was allocating a 256 MB CPU-backed ggml_context per sampling call to receive the streaming output. Replace with a no_alloc context whose tensor metadata points directly at the destination sd::Tensor's memory, eliminating the per-step malloc/free of 256 MB.

The main streaming loop was hardcoded to prefetch only one layer ahead, ignoring the configured prefetch depth. Replace with a sliding window that primes the first N layers and refills the prefetch slot each step, where N comes from streaming_engine_->get_config().prefetch_layers. This finally makes the prefetch_layers knob actually do something.

Every per-block streaming loop (anima, flux double/single, mmdit, qwen_image, unet input/output, z_image) was hardcoded to prefetch only one block ahead, ignoring streaming_prefetch_layers. Add prime_prefetch and advance_prefetch helpers to LayerExecutionEngine and route every runner through them.

Each main layer was destroying and recreating the ggml_gallocr_t between iterations, idling the GPU during the rebuild. All main blocks have the same shape, so the same allocator can serve every block of every sampling step. Free only when transitioning to the output stage.

apply_loras_at_runtime always wrapped each model (cond_stage, diffusion, first_stage) with a MultiLoraAdapter, even when no LoRA tensors matched that model's prefix. The empty adapter routed every linear/conv through forward_with_lora() instead of the direct kernel path. Skip the wrap when the matching lora_models list is empty so unaffected models keep the fast direct path.

…oading

The TRUE per-layer streaming path was unconditionally evicting every block back to CPU after each forward pass, even when there was plenty of free VRAM left. For an 8-step generation that re-streams the entire model 7 extra times. Decide once, on the first sampling step, how many leading blocks fit permanently in VRAM (after subtracting prefetch headroom + compute buffer + safety margin) and skip the eviction for those indices. Later steps' prime_prefetch starts at the first non-resident block, so the cache prefix is hit for free. Pattern follows ComfyUI's ModelPatcher.partially_load() — a static partition is simpler and cheaper than dynamic eviction for the cyclic-sequential access pattern of diffusion sampling. Also fix MemoryBudgetManager::query_device_memory(): the SD_USE_CUDA guard was dead code after PR leejet#1448 switched to runtime backend discovery, so every build was returning the hardcoded 8 GB / 4 GB fallback regardless of the real GPU. Use ggml_backend_dev_memory() instead — works for CUDA, Vulkan, Metal. For ZImage 8 steps at 688x1024 on RTX 3060 12 GB: before: 7.21s/step steady, 57.80s sampling after: 4.45s/step steady, 39.64s sampling (1.46x) Same caching helper (compute_resident_block_count) added to LayerExecutionEngine and applied to z_image, flux (double + single), mmdit, anima, qwen_image. UNet (skip connections) and WAN (no per-layer streaming yet) unchanged.

When weights live on CPU but get transferred to GPU during compute, allocate the params buffer from the GPU device's pinned host buffer type. This makes ggml_backend_tensor_copy_async actually overlap with compute on CUDA — without it, the backend silently falls back to a staged sync copy through an internal bounce buffer. For ZImage 8 steps with the layer cache from the previous commit: before: step1 8.10s, steady 4.45s, sampling 39.64s after: step1 5.61s, steady 3.97s, sampling 33.95s (1.17x on top of cache) Cold step gets the bigger win (-31%) because all 30 layers stream once. Steady-state gain is smaller (-11%) because each streamed layer still triggers a fresh cudaMalloc that serializes against the copy stream — fixing that requires a buffer pool in tensor_registry, which is a separate change. One-time cost: model load takes longer because page-locking 11.7 GB of host memory is slower than allocating pageable. Amortizes immediately for any service that does more than one generation per load. Falls back to pageable allocation if pinned alloc fails (system out of locked pages). Applies to any GGMLRunner where params live on CPU but runtime is GPU — diffusion model and CPU-resident LoRAs benefit; clip-on-cpu paths skip cleanly because their runtime is also CPU.

The per-layer streaming loop bounces ~22 MB of activations through host RAM between every layer (download txt_img output, re-upload as next layer's input). With std::vector backing, the CUDA backend stages those transfers through an internal pinned bounce buffer, which costs roughly 16 ms per layer = 474 ms per sampling step. Allocate the persistent_txt_img and persistent_t_emb backing storage in a single GPU-pinned host buffer (via ggml_backend_dev_host_buffer_type) so the same get/set calls run at full PCIe bandwidth. Falls back to pageable std::vector if pinned alloc fails. Also adds an opt-in per-step profile (SDCPP_STREAM_PROFILE=1) that breaks out wait/load/advance/compute/tensor_get timings — used to identify this hotspot and measure the fix. For ZImage 8 steps at 688x1024 on RTX 3060 12 GB, prefetch=2: before: 33.95s sampling, ~3.97s/step steady, tensor_get=474 ms/step after: 29.32s sampling, ~3.45s/step steady, tensor_get=100 ms/step Cumulative speedup across the layer-streaming work in this branch (P1 cache + P2 pinned weights + P3a pinned activations): 58.31s → 29.32s, just under 2x on the sampling loop for an 11.5 GB bf16 model on a 12 GB GPU. The dominant remaining cost is `compute` itself (2.7 s/step), which is graph build + gallocr + dispatch. Reducing that needs graph reuse across layers — separate change.

The per-layer streaming loop was rebuilding the same DiT block graph 30 times per sampling step — same operations, just different weight tensor instances. Profiling showed ~810 ms of pure CPU-side work per step in graph build + gallocr (no GPU activity, GPU at 17 W / 46 °C). Build the cgraph once for layer 0 and reuse it for layers 1..29 by swapping the runtime tensor pointers (buffer/data/extra) between layer 0 and layer N before each dispatch, then swapping back before move_layer_to_cpu. All 30 main blocks share an identical JointTransformerBlock structure, so the cached graph references valid ops once layer N's data sits behind layer 0's tensor pointers. Two new pieces: - TensorRegistry::swap_layer_buffers(a,b) — exchanges the runtime buffer/data/extra fields between two structurally-identical layers. - GGMLRunner::dispatch_cached_graph(gf) — runs alloc_graph + uploads + compute on a graph that's still alive in compute_ctx, skipping the build/reset cycle that compute() does each call. Disabled when an at-runtime WeightAdapter (LoRA) is attached to the runner: forward_with_lora() bakes layer-specific prefixes into the adapter ops at graph-build time, so a cached graph would always apply layer 0's LoRA delta to every layer. The fallback path is the existing per-layer rebuild — bytewise identical output to before this change (verified by md5 of the test image), so this is a free improvement for non-LoRA workloads with zero risk for LoRA ones. For ZImage 8 steps at 688x1024 on RTX 3060 12 GB, prefetch=2: with LoRA (fallback path): 29.19 s sampling (matches prior P3a) without LoRA (reuse active): 22.84 s sampling (1.28x vs P3a) steady step compute: 2710 ms -> 1890 ms (-30%) Cumulative on the layer-streaming path vs the original baseline: with LoRA: 58.31s -> 29.19s (2.00x) without LoRA: 58.31s -> 22.84s (2.55x) Also adds an opt-in per-step profile (SDCPP_STREAM_PROFILE=1) that breaks out wait/load/advance/compute/tensor_get — used to identify the build-cost hotspot this change targets.

This reverts commit 41c3ca2.

z_image already pinned its persistent_txt_img / persistent_t_emb host buffers (commit 9168495). The other DiT runners (flux, mmdit, anima, qwen_image) still backed their per-block streaming activations with pageable std::vector, forcing the CUDA backend to stage every ggml_backend_tensor_get and copy_data_to_backend_tensor through an internal bounce buffer. Promote the pinning machinery onto GGMLRunner as a shared ensure_pinned_act_buffers(sizes_bytes, out_ptrs) helper that allocates a single GPU-pinned host buffer big enough for all the runner's persistent activation regions and hands back 256-byte aligned start pointers. Buffer is freed in ~GGMLRunner; falls back to pageable std::vector if pinned alloc fails (output stays correct, just slower). Each runner now declares its persistent_<name> regions as float* into that shared buffer, with std::vector<float> fallbacks. Refactored z_image to use the shared helper too — same bit-exact output as before (verified: md5 of /tmp/bench_pin_smoke.png matches the previous P3a baseline image). For ZImage 8 steps at 688x1024 on RTX 3060 12GB, prefetch=2: before refactor: 29.32s sampling (already pinned) after refactor: 30.02s sampling (within run-to-run noise) The bigger story: flux/mmdit/anima/qwen_image streaming users now get the same ~10-15% activation-transfer speedup that z_image got from P3a. Can't bench those directly without their respective models, but the change is purely host-side memory allocation — same code path ggml uses everywhere.

Streaming runs the K resident-on-GPU layers through one combined ggml graph per step instead of building+dispatching a fresh tiny graph per layer. The streamed-tail layers still use per-layer dispatch since their weights swap in/out and topologies differ. Adds a separate ggml_context, gallocr, and cgraph for the chunk on ZImageRunner so the graph survives compute_ctx resets between streamed-tail calls. Inputs (txt_img, t_emb, pe) are bound at chunk-graph build time and re-uploaded each step via ggml_backend_tensor_set. Measured on RTX 3060 / z_image_turbo bf16 / 8 steps: P3a baseline: 29.32s + chunk graph: 28.34s (~3%) Pixel-exact vs P3a baseline (md5 f54bf459...). Compounds with the dual-stream H2D overlap on feature/pcie-overlap.

The Phase 4 chunk graph caches its input tensors (txt_img, t_emb, pe) with the shapes from the first build, but token sequence length depends on the prompt — different prompts produce different txt_img_ne[1]. Reusing the cached graph in subsequent generate_image() calls left ggml_backend_tensor_set writing the wrong byte count, the compute then ran on tensors with garbage shape metadata, and ZImage layers eventually hit a divide-by-zero (SIGFPE). Visible as sdcpp-restapi crashing on the second queue job. Compares cached chunk_txt_img_in_/chunk_t_emb_in_/chunk_pe_ shapes against the current call's; rebuilds the chunk graph if any shape (or the resident layer count) differs.

The Phase 4 chunk-graph code (build / dispatch / shape-match / free) was inlined into ZImageRunner. Moves it into a reusable helper in a new src/chunk_graph.hpp so other DiT runners (flux, mmdit, anima, qwen_image, unet, wan) can adopt it later by providing only: - the input shape vector, - a build callback that wires K layers using the supplied input tensors, - per-dispatch host data pointers. The helper owns its own ggml_context + gallocr + cgraph, handles the cache staleness rebuild from 4f445e2 internally, and exposes output() so callers can read back the resulting tensor's shape. ZImageRunner now stores a single LayerStreaming::ChunkGraph and provides a small dispatch_resident_chunk wrapper that supplies the z_image-specific build lambda (forward_layer_block over K resident layers). Pixel-exact output preserved (verified vs P3a baseline).

GGMLRunner::prepare_build_in_tensor_before() creates two scalar tensors (":one" / ":zero_int") on compute_ctx that op helpers like ggml_ext_full, ggml_ext_zeros, ggml_ext_ones, and ggml_ext_cast_f32 look up by name via ggml_get_tensor. The chunk graph uses a separate chunk_ctx_ that survives across compute() calls, and those named tensors were never created on it — so any lookup returned null and the next op SEGV'd. Reproduces with short prompts: ggml_ext_attention_ext takes a KV-pad branch that calls ggml_ext_full to build a -INF mask. Long prompts happen to satisfy the alignment and skip that branch, which is why the bug stayed hidden until a "a cat"-class prompt hit per-layer streaming with the chunk graph engaged. Mirrors prepare_build_in_tensor_before/after on chunk_ctx_: creates the two named tensors before build_fn runs, adds them to the graph after, and uploads the constant scalar values (1.0f / 0i) on every dispatch.

apply_loras_at_runtime() creates a fresh MultiLoraAdapter per call, replacing the diffusion model's weight_adapter shared_ptr. The cached chunk graph still holds raw ggml_tensor* references into ops emitted by the previous adapter — once the old adapter is destroyed, those tensors are freed and the cache becomes a use-after-free trap. Adds an opaque state_token parameter to ChunkGraph::ensure_built that gets compared alongside K and shapes; mismatch frees the cache and rebuilds. The caller (z_image) fingerprints its weight_adapter pointer plus the runner boolean flags (flash_attn / conv2d_direct / circular_x / circular_y) into the token, so any of those changing across queue jobs forces a rebuild. This is the third in a series fixing pre-existing Phase 4 bugs: - 4f445e2: shape staleness across jobs (different prompt token counts) - 836b0b1: missing build-in tensors (one / zero_int) in chunk_ctx - this: weight_adapter use-after-free across LoRA swaps

Phase 4's chunk graph and the resident-layer cache held GPU memory across generate_image() calls indefinitely: - The cached chunk graph kept its compute buffer (~500 MB) and references into the resident layers' GPU tensors. - resident_layer_count_ was set once and never reset, so every subsequent call left the same 19 layers (~7.7 GB) on GPU even after offload_streaming_layers() evicted them. The chunk graph then carried pointers into the freed memory. In long-running processes (sdcpp-restapi) with LoRA at_runtime, every generation creates a fresh MultiLoraAdapter — state_token changes, so ChunkGraph rebuilds. Each rebuild called clear() but the previous cache plus stale pointers from earlier jobs accumulated VRAM until cudaMalloc failed mid-generation (saw 9.8 GB used / 0.6 GB free after 4 jobs, OOM on job 5). Adds a virtual on_streaming_layers_offloaded() hook in GGMLRunner, called at the end of offload_streaming_layers(). ZImageRunner overrides it to clear chunk_graph_ and reset resident_layer_count_ so the next generation recomputes the resident set against the actual free VRAM and builds a clean chunk graph. Verified on RTX 3060: 4 batch=4 / 12-step LoRA jobs back-to-back, VRAM holds steady at ~9.7 GB free between jobs (was 0.6 GB before), per-job time stable at 180-184s, no OOM. Within-generation reuse (12 steps × 4 batch images = 48 dispatches share one chunk graph) is preserved, so the sampling speed is unchanged.

Brings in upstream's leejet#1476 max-vram graph-cut segmented param offload alongside our layer-streaming work. Both mechanisms coexist: - `--offload-mode layer_streaming` (ours) — per-layer streaming with prefetch, chunk graph for resident block. ~2× faster than graph-cut on bf16/12GB GPU based on A/B bench (175s vs 335s generate_image). - `--offload-to-cpu --max-vram <GiB>` (upstream leejet#1476) — static segment plan, swap params per segment. Notable conflict resolutions in src/ggml_extend.hpp: - Kept upstream's two-step prepare_compute_graph + alloc_compute_buffer(gf) as the canonical path; added a backward-compatible single-call overload `alloc_compute_buffer(get_graph_cb_t, ggml_cgraph**)` that wraps both for the layer-streaming caller. - copy_data_to_backend_tensor() ggml_cgraph parameter is now optional (default nullptr) — graph-cut passes the graph for filtering, layer streaming passes nullptr to upload everything from the map. - free_compute_buffer() does both restore_partial_params/restore_all_params (upstream graph-cut cleanup) and our auto_offload_after_compute hook. - compute() body takes upstream's graph-cut-aware dispatch verbatim. Renamed our internal helpers to match upstream: - offload_params_to_runtime_backend → offload_all_params - offload_params_to_params_backend → restore_all_params Dropped sd_ctx_params_t::flow_shift; upstream moved it to sd_sample_params_t (sd_sample_params_init initialises it there). Verified: Z-Image-Turbo Q8 builds and runs end-to-end on both paths (layer_streaming and --max-vram) in this binary.

The destructor previously released runtime_params_buffer but missed partial_runtime_params_buffer (the buffer used by the segmented param offload path added in leejet#1476). On runner destruction with --max-vram active, that GPU memory leaked. Same class of leak as the existing runtime_params_buffer fix.

Per-layer streaming runs many short kernels and waits on each one. The CUDA driver default schedule (cudaDeviceScheduleAuto) often picks Spin, which busy-waits one host thread on each kernel return - shows as 100% on one CPU core in top/nvtop even though the wait is idle work. Document two fixes: CUDA_DEVICE_SCHEDULE=BlockingSync env var for single-shot CLI runs, or cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) at process startup for long-lived servers. No code change here - just user-facing guidance to avoid the "why is my CPU at 100%" question.

The frontend submodule pointer carried over from our fork was a SHA from an older repo (leejet/stable-ui) that doesn't exist on leejet/sdcpp-webui (the URL declared in .gitmodules). CI couldn't fetch it and every job failed at the submodule init step. Sync to upstream master's SHA (797ccf8). The webui isn't part of the offload work and we don't need a fork-local version on this branch.

Two related budget-planner fixes for our streaming path: 1. Propagate --max-vram into MemoryBudgetManager so the same flag drives both leejet's graph-cut path and our layer-streaming planner. Lets users simulate a smaller card without a separate flag. The cap is applied via init_streaming() after the engine is created so it survives whichever order set_max_graph_vram_bytes() and the engine construction happen in. 2. Reserve a compute-buffer slice (default 768 MB, matches compute_resident_block_count's existing convention) when deciding coarse-stage vs per-layer in analyze_vram_budget(). Without this, params can fit in capped VRAM but params + CB tip over mid-step and crash cudaMalloc — visible on SDXL 1024x1024 with --max-vram 6 where the compute graph wants 830 MB on top of 4.79 GB params.

UNet's compute_streaming had four bugs that didn't surface until SDXL + --max-vram pushed the planner into per-layer mode: 1. Coarse-stage path called regular compute() without skip_param_offload=true, double-allocating UNet params on the runtime backend (4.79 GB ZImage, 4.79 GB SDXL). Other architectures already pass true; only unet.hpp was missing it. 2. forward_input_block() called resblock_forward() for every input_blocks.X.0 entry, but at indices 3 and 6 the slot is a DownSampleBlock — the dynamic_pointer_cast<ResBlock> returned null and the next forward() segfaulted silently. Now dispatches DownSampleBlock vs ResBlock by actual type. 3. forward_output_block() called attention_layer_forward() for output_blocks.X.1, but on SD1.x's deepest output block (no attention at that resolution) the slot holds an UpSampleBlock, producing the same null-cast crash. Now walks .1 and .2 once each and dispatches UpSampleBlock vs SpatialTransformer by type. 4. get_num_input_blocks()/get_num_output_blocks() returned a hardcoded 12. SDXL has 9, tiny_unet variants have gaps. Replaced with a scan of the blocks map for the actual max index, so the streaming loop iterates over indices the model actually has. Verified with --max-vram cap forcing per-layer streaming on SDXL 1024x1024, SD1.5 512x512, plus regression on Z-Image bf16, Z-Image Q8, Flux schnell, Chroma, Anima, Qwen Image, and SD3.5 Large.

fszontagh added 30 commits March 4, 2026 07:34

fszontagh added 25 commits March 6, 2026 14:29

Add VRAM offloading documentation

1ad143c

Document all offload modes, layer streaming internals, supported architectures, usage examples, and quality impact of each technique.

Merge remote-tracking branch 'upstream/master' into feature/vram-offl…

0509ad9

…oading

Revert "Reuse a single layer graph across all z_image streaming layers"

b029a77

This reverts commit 41c3ca2.

fszontagh force-pushed the feature/vram-offloading-v2 branch from f6815d6 to dc8e9e2 Compare May 6, 2026 19:46

fszontagh added 4 commits May 6, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477
fszontagh wants to merge 68 commits intoleejet:masterfrom
fszontagh:feature/vram-offloading-v2

fszontagh commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fszontagh commented May 6, 2026

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs

Why

New CLI flags

What each mode does

How layer streaming works

Benchmarks - RTX 3060 (12 GB), PCIe 3.0 x16

Z-Image-Turbo bf16 (11.5 GB diffusion model — does NOT fit in 12 GB)

Z-Image-Turbo Q8 (6.7 GB diffusion model — fits in VRAM, but VAE compute buffer doesn't)

Architectures

Known issues

Backwards compatibility

Bug fixes folded in

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant