Skip to content

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477

Open
fszontagh wants to merge 68 commits intoleejet:masterfrom
fszontagh:feature/vram-offloading-v2
Open

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477
fszontagh wants to merge 68 commits intoleejet:masterfrom
fszontagh:feature/vram-offloading-v2

Conversation

@fszontagh
Copy link
Copy Markdown
Contributor

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs

Why

Two problems that come up on small GPUs running large diffusion models:

  1. Cross-stage component placement. Where does the text encoder live while diffusion runs? Where does diffusion go while the VAE decodes? On a 12 GB card running an 11.5 GB diffusion model, we need to move components in and out between stages or VAE decode hits OOM.
  2. Models that don't fit at all. When the diffusion weights themselves exceed VRAM, we need to stream them in per-layer rather than load all at once.

This PR adds a single new flag, --offload-mode, that handles cross-stage placement, plus a per-layer streaming path (--offload-mode layer_streaming) for the doesn't-fit-at-all case.

New CLI flags

Flag Description
--offload-mode <mode> One of none, cond_only, cond_diffusion, aggressive, layer_streaming. Default none.
--offload-cond-stage / --no-offload-cond-stage Override the cond-stage offload decision.
--offload-diffusion / --no-offload-diffusion Override the diffusion-model offload decision.
--offload-log / --no-offload-log Log offload events to stderr.
--vram-estimation <method> dryrun (probe graph) or formula (analytic).
--streaming-prefetch <N> Layers to prefetch ahead during streaming. Default 1.
--streaming-min-vram <MB> Minimum free VRAM kept during streaming. Default 512.

What each mode does

Mode What it does Use case
none (default) No offload. Identical to current master behaviour. Default; everything fits on GPU.
cond_only Move text encoder to CPU after conditioning, keep diffusion on GPU. Tight VRAM during diffusion.
cond_diffusion Move both text encoder and diffusion model out between stages, swap them in for their stage. VAE decode needs room; diffusion is too big to coexist with VAE compute buffer.
aggressive Evict every component as soon as it's not actively used; reload on demand. Lowest VRAM footprint at any moment; pays reload costs each transition.
layer_streaming Diffusion weights live in pinned host RAM; each transformer block uploads to GPU just before it runs and is evicted afterwards. Async prefetch keeps PCIe full. Models that don't fit at all (Z-Image bf16 11.5 GB on 12 GB card).

How layer streaming works

Three pieces, each a known-but-effective optimization at a different layer of the stack:

  1. Pinned host buffer for streamed weights, so cudaMemcpyAsync actually goes async (a pageable source falls through to a synchronous bounce-buffer copy in the driver).
  2. Per-layer prefetch overlapped with the previous layer's compute - the next layer's H2D starts on a separate stream while the current kernel is still running.
  3. Chunk graph for the resident block - layers that fit on GPU stay there across sampling steps and run as one combined ggml graph dispatch instead of one mini-graph per layer.

A unified VRAM heuristic decides automatically which layers stay resident and which stream, based on actual free VRAM. Users don't have to pick a budget manually.

Benchmarks - RTX 3060 (12 GB), PCIe 3.0 x16

Hardware: RTX 3060 12 GB. The card itself supports PCIe 4.0, but the board is DDR3-era so the slot is capped at PCIe 3.0 x16 (8.0 GT/s). PCIe bandwidth is the dominant cost during streaming, so faster boards (PCIe 4.0 x16, ~24 GB/s practical) should reduce these numbers materially.

All numbers below: batch_count=4, steps=12, resolution=688x1024, LoRA applied at runtime, same prompt/seed across configs.

Z-Image-Turbo bf16 (11.5 GB diffusion model — does NOT fit in 12 GB)

Workload: 4 images per generation, 12 sampling steps each, batch=4. This is where streaming matters most — without offload of some kind, the model can't even load.

Config generate_image Notes
--offload-mode layer_streaming 175 s This PR. GPU utilization steady >90%; effective PCIe TX ~3.5 GB/s during streaming windows.
--offload-to-cpu --max-vram 9 335 s Existing graph-cut path. ~2× slower.

Z-Image-Turbo Q8 (6.7 GB diffusion model — fits in VRAM, but VAE compute buffer doesn't)

Workload: 4 images per generation, 12 sampling steps each, batch=4. When the model fits, streaming gives up most of its advantage and the simpler existing offload paths are slightly faster. Listed for completeness.

Config generate_image Notes
--offload-to-cpu 115 s Fastest when model fits.
--vae-tiling 118 s Tile VAE compute on GPU.
--offload-mode layer_streaming 122 s Auto-picks coarse-stage; still goes through streaming bookkeeping (~6% overhead).
--offload-to-cpu --max-vram 6 152 s Graph-cut adds dispatch overhead even when params fit.
--vae-on-cpu 602 s Reference; VAE on CPU is brutal.

So the recommendation in the docs is: pick --offload-mode layer_streaming when the model doesn't fit (where it's ~2× faster than alternatives), and stick with the existing --offload-to-cpu (or no offload) when it does. --offload-mode none (default) keeps current master behaviour.

Architectures

The streaming runtime is shared via tensor_registry.hpp, layer_streaming.hpp, memory_budget.hpp. Verified end-to-end on RTX 3060:

  • Z-Image / Z-Image-Turbo (bf16 + Q8) - primary target
  • Flux schnell
  • Anima
  • Qwen Image

Implemented and built but not personally verified by me - appreciate someone with the hardware/models confirming:

  • MMDiT / SD3
  • UNet (SD1.x / SDXL)
  • WAN

Known issues

  • --lora-apply-mode immediately + --offload-mode layer_streaming crashes - the immediate folder reaches into weight buffers that haven't been uploaded to GPU yet under streaming. Use at_runtime (default auto already picks this in streaming mode). Pre-existing class of issue surfaced by streaming.
  • VRAM estimation isn't perfect; dryrun is more accurate but adds a small startup cost. Switch to dryrun if you hit OOM during the first step.

Backwards compatibility

Default behaviour is unchanged. --offload-mode none matches current master byte-for-byte. All new flags are opt-in.

Bug fixes folded in

While exercising the offload paths I found and fixed a small set of pre-existing bugs. They're independent of the new offload modes and benefit users who never set --offload-mode. Happy to split these into a separate small PR if preferred.

  • GGMLRunner destructor leaked runtime_params_buffer and partial_runtime_params_buffer. free_params_buffer() only released the CPU-side params_buffer. When the runner had been staged onto the runtime backend (any offload mode active, including the segmented offload from feat: add max-vram based segmented param offload #1476), the GPU-side weight buffer(s) leaked on destruction. Real leak under LoRA + offload — many short-lived runners are created during LoRA application. Two-line additions to the destructor.
  • CFG causing redundant model reloads under streaming.
  • t_emb buffer aliasing in Z-Image's per-layer path.
  • GGMLRunner scratch-buffer reuse.
  • VAE-encode OOM in aggressive mode.
  • Includes Skip empty MultiLoraAdapter when no LoRAs target a model #1469's empty-MultiLoraAdapter fix (already merged into master); will rebase to drop that commit at PR time.

Documentation

docs/vram_offloading.md covers the modes, decision tree, and example commands.

fszontagh added 30 commits March 4, 2026 07:34
Add runtime tensor offloading to enable running large models (Q8+)
on GPUs with limited VRAM by dynamically moving components between
GPU and CPU memory.

- `cond_only`: Offload cond_stage (LLM/CLIP) after conditioning
- `cond_diffusion`: Offload both cond_stage and diffusion after use
- `aggressive`: Offload each component immediately after use

- Add OffloadConfig struct with mode, flags for cond_stage/diffusion
- Add move_params_to_cpu/gpu methods to GGMLRunner
- Add set_auto_offload() to control automatic offloading behavior
- Implement on-demand reload before conditioning/diffusion steps
- Track VRAM usage for offloaded components

Enables 1024x1024 generation with Z-Image Q8 (~7GB) + Qwen3-4B Q8
(~4GB) + VAE (~320MB) on 12GB GPU by offloading the ~4GB LLM after
conditioning completes, freeing VRAM for diffusion compute buffers.

Without offloading: CUDA OOM during diffusion
With cond_only offload: Successful generation in ~66s

Tested configurations:
- offload_mode=none: OOM at 1024x1024 with Q8 models
- offload_mode=cond_only: Success, ~66s generation time
- offload_mode=cond_only + vae_tiling: Success, ~149s
Expose the dynamic tensor offloading feature through CLI options:
- --offload-mode: Set offload mode (none, cond_only, cond_diffusion, aggressive)
- --offload-log: Enable offload event logging
- --no-offload-log: Disable offload event logging

The cond_only mode is particularly useful for 12GB GPUs running large
Q8 models with LLMs, as it offloads the LLM/CLIP to CPU after
conditioning, freeing VRAM for diffusion compute buffers.

Changes:
- Add sd_offload_mode_name() and str_to_offload_mode() helper functions
- Add sd_offload_config_init() for default configuration
- Add offload_config member to SDContextParams
- Wire offload_config through to_sd_ctx_params_t()
- Add CLI options in get_options()
When dynamic offloading is enabled and the LLM/CLIP model was offloaded
to CPU, attempting to reload it to GPU could fail if there's not enough
VRAM available. Previously, the code logged a misleading warning
"conditioning will run on CPU (slower)" but then crashed (SEGV) because:

1. move_params_to_gpu() failed and returned false
2. Code continued to call get_learned_condition()
3. compute() tried offload_params_to_runtime_backend() which failed again
4. compute() returned false but caller didn't check return value
5. Code tried to use uninitialized data, causing SEGV

Fix:
- Return NULL from generate_image/generate_video when GPU reload fails
- Return false from load() if initial GPU move fails
- This gives callers a proper error to handle instead of crashing

The user will see a clear error message suggesting to reduce resolution,
use smaller models, or disable dynamic offloading.
When offload_mode is enabled and LoRAs are being applied, the cond_stage
(LLM/CLIP) may still be on GPU from initial model loading. This uses up
VRAM and causes LoRA allocation to fail with OOM.

Fix: Before applying LoRAs in generate_image(), check if:
1. offload_mode is enabled
2. offload_cond_stage is true
3. We have LoRAs to apply
4. cond_stage is currently on GPU

If all conditions are met, offload cond_stage to CPU first to free VRAM
for LoRA allocation. The cond_stage will be reloaded on-demand before
conditioning runs.

This allows using LoRAs with large LLM models (like qwen3-4b) on 12GB GPUs
that would otherwise OOM during LoRA allocation.
When cond_stage reload fails due to LoRA buffers using VRAM:
1. Free LoRA buffers to make room
2. Retry cond_stage reload
3. Reload LoRA weights from disk

Added reload_params() method to LoraModel to support reloading
weights after buffer is freed and reallocated.

This enables using LoRA with cond_only offload mode on GPUs
where cond_stage + LoRA can't both fit alongside diffusion model.
- Add enable_offload parameter to LoraModel constructor
- Enable CPU offload for LoRA when dynamic offloading is active
- Use move_params_to_cpu()/move_params_to_gpu() for fast memory transfers
  instead of free_params_buffer()/reload_params() disk I/O

This makes LoRA offloading ~10-50ms instead of ~500-1000ms from disk.
When offload mode is enabled, GGMLRunner has both:
- params_buffer (CPU)
- runtime_params_buffer (GPU)

The destructor only freed params_buffer, causing GPU memory to
leak when LoRA models were destroyed while on GPU. This caused
OOM errors after multiple generations with LoRAs.
- Add sd_vram_estimation_t enum for estimation method selection
  - SD_VRAM_EST_DRYRUN (default): accurate graph-based estimation
  - SD_VRAM_EST_FORMULA: faster formula-based approximation

- Add estimate_compute_buffer_size() to GGMLRunner for dry-run
  allocation that returns required buffer size without allocating

- Add estimate_vae_decode_vram() to calculate VAE decode requirements
  using either dry-run or formula method

- Add smart_offload_for_vae() that estimates VRAM needed and
  offloads only what's necessary before VAE decode

- Call smart_offload_for_vae() before decode in image and video
  generation paths

This enables smarter offloading - only offload components when
actually needed based on accurate VRAM estimation.
- Add get_free_vram() helper to query actual GPU memory via CUDA
- Add estimate_diffusion_vram() for diffusion sampling memory estimate
- Add should_offload_cond_stage_for_diffusion() smart check
- Add should_offload_diffusion_for_vae() smart check
- Replace unconditional offload with VRAM-aware decisions
- Only offload when free_vram < next_phase_needs + 300MB margin
- Apply to both txt2img and img2img/video generation paths
- Update common.hpp for vram_estimation struct field order

On larger GPUs, components stay on GPU between phases for speed.
On tight VRAM, offloading still occurs as needed.
- Add reload_diffusion field to sd_offload_config_t struct
- Default to true (matches previous always-reload behavior)
- Make post-generation reload of diffusion model respect config
- Update both txt2img and video generation paths
- Allows keeping diffusion offloaded between generations for batch work

Benchmark results on 12GB GPU with Z-Image Q8_0:
- no_reload: 29-30s generation, 1.9GB GPU after
- reload: 32s generation, 8.1GB GPU after
New CLI options:
- --offload-cond-stage / --no-offload-cond-stage
- --offload-diffusion / --no-offload-diffusion
- --reload-cond-stage / --no-reload-cond-stage
- --reload-diffusion / --no-reload-diffusion
- --vram-estimation [dryrun|formula]

Also adds:
- sd_vram_estimation_name() and str_to_vram_estimation() API functions
- Extended toString() output showing all offload config details
This commit adds the foundation for layer-by-layer tensor streaming,
enabling models larger than VRAM to run by loading weights on-demand.

New components:
- TensorRegistry: Tracks individual tensor locations (GPU/CPU) by layer
- MemoryBudgetManager: Manages VRAM budget with eviction policies
- LayerExecutionEngine: Orchestrates per-layer execution with prefetch

Integration:
- FluxRunner gains enable_layer_streaming() for streaming mode
- New SD_OFFLOAD_LAYER_STREAMING offload mode
- CLI: --offload-mode layer_streaming

This is the infrastructure foundation. Per-block execution will be
added in subsequent commits.
GGMLBlock stores tensor names in its internal `params` map hierarchy,
but never calls ggml_set_name() on the actual GGML tensors. This caused
register_from_context() to get empty names for all tensors, mapping
everything to the "_global" layer (resulting in "registered 1 layers").

Fix: Add register_from_map() method that takes the tensor map from
get_param_tensors(), which preserves proper tensor names like
"model.diffusion_model.double_blocks.5.img_attn.qkv.weight".

Result: 58 layers now registered correctly for Flux models (19 double_blocks
+ 38 single_blocks + 1 _global) instead of just 1.
…cking

1. Skip move_params_to_gpu() for diffusion model in layer_streaming mode
   - Before sampling: don't bulk-load entire diffusion model to GPU
   - After generation: don't reload diffusion in streaming mode

2. Fix tensor name tracking in TensorRegistry::move_layer_to_gpu
   - Use stored tensor names instead of relying on ggml_get_name()
   - GGMLBlock doesn't call ggml_set_name() on original tensors

Known issue: Graph context invalidation in streaming path needs fixing
(alloc_compute_buffer resets compute_ctx after graph is built)
Two critical fixes for layer streaming mode:

1. Flux preprocessing: Add to_backend() calls for input tensors
   - The regular build_graph() converts external tensors to compute_ctx
   - Streaming preprocessing was missing this, causing mul_mat assertions
   - Now properly converts x, context, timesteps, y, guidance to backend

2. UNet streaming: Add skip_param_offload parameter to compute()
   - In streaming mode, weights are managed by the streaming engine
   - The regular compute() was trying to bulk-allocate all weights to GPU
   - This failed with OOM because streaming only loads layers on demand
   - New skip_param_offload=true prevents this bulk allocation

Testing: Successfully generated 512x512 image with SDXL model using
--offload-mode layer_streaming, 4 steps completed in 3.78s
MMDiT has no skip connections, making it ideal for layer streaming:
- Added mmdit_layer_pattern() to parse joint_blocks.N tensor names
- Added streaming infrastructure to MMDiTRunner (enable/disable/compute)
- Added compute_streaming() that loads all joint_blocks before execution
- Wired MMDiTModel to DiffusionModel streaming interface

MMDiT structure:
- 24 joint_blocks (each with context_block + x_block)
- Global tensors: x_embedder, t_embedder, y_embedder, context_embedder, final_layer
WAN has sequential transformer blocks ideal for streaming:
- Added wan_layer_pattern() to parse blocks.N and vace_blocks.N tensor names
- Added streaming infrastructure to WanRunner (enable/disable/compute)
- Added compute_streaming() that loads all blocks before execution
- Wired WanModel to DiffusionModel streaming interface

WAN structure:
- 30-40 blocks.N (main transformer blocks)
- Optional vace_blocks.N (VACE interleaved blocks)
- Global tensors: patch_embedding, text_embedding, time_embedding, head
- Add qwen_image_layer_pattern() for 60 transformer_blocks
- Add zimage_layer_pattern() for context_refiner + noise_refiner + layers
- Add streaming infrastructure to QwenImageRunner and ZImageRunner
- Wire both models to DiffusionModel streaming interface
- Update compute() methods to accept skip_param_offload parameter

All 6 diffusion model architectures now support layer streaming.
- Add ref_latents and increase_ref_index parameters to compute_streaming
- Update FluxModel::compute_streaming to pass ref_latents
- Convert ref_latents to backend in preprocessing graph
- Handle ref_latents patchification and concatenation

Note: Flux streaming still has tensor context issue in preprocessing
that needs investigation.
The per-layer mini-graph approach was architecturally broken because:
1. GGML tensors are bound to their compute context
2. alloc_compute_buffer() resets context internally
3. Intermediate results cannot be passed between separate graphs

Changed to coarse-stage approach:
1. Load all model weights to GPU via streaming engine
2. Execute full compute graph with skip_param_offload=true
3. This matches the working UNet streaming implementation

Also added skip_param_offload parameter to FluxRunner::compute()
In layer_streaming mode, the cond_stage (T5) must be offloaded before
layer streaming begins, otherwise there won't be enough VRAM for the
diffusion model layers.

Changes:
- Set free_params_immediately=false for layer_streaming mode in CLI
  This enables smart offload logic instead of immediate param freeing
- Add explicit layer_streaming check in should_offload_cond_stage_for_diffusion()
  Forces T5 offload regardless of VRAM heuristics

Without this fix, T5 (~9GB) stays on GPU while layer streaming tries to
load Flux layers (~6.5GB), causing OOM on 12GB cards.

Tested with Flux Schnell Q4_K + T5XXL fp16 on RTX 3060 12GB:
- T5 properly offloaded after conditioning
- Layer streaming loads all 58 layers successfully
- Image generation completes without OOM
Implements the same coarse-stage layer streaming approach used by
Flux, MMDiT, UNet, and other models for the new Anima diffusion model.

Changes:
- tensor_registry.hpp: Add anima_layer_pattern() for net.blocks.N extraction
- anima.hpp: Add streaming engine, enable/disable/compute_streaming methods
- diffusion_model.hpp: Add AnimaModel streaming wrapper methods

Anima has 28 transformer blocks by default, similar in structure to
other DiT models, making it a good candidate for VRAM offloading on
memory-constrained systems.
AnimaConditioner:
- Add GPU offloading methods (is_params_on_gpu, move_params_to_cpu,
  move_params_to_gpu, get_params_vram_size, set_auto_offload)
  delegating to underlying LLM
- This enables proper VRAM management for Anima's Qwen3 text encoder

Layer streaming state consistency:
- Skip diffusion model state manipulation in layer_streaming mode
- The TensorRegistry uses direct buffer pointer swapping which leaves
  GGMLRunner's internal state (params_on_runtime_backend) out of sync
- Querying or manipulating diffusion offload state after streaming
  would cause crashes due to this inconsistency
- cond_stage offload still works normally (not managed by streaming)

Tested: Anima model generates identical output with and without
layer_streaming enabled (verified via MD5 hash comparison)
Problem: After layer streaming completes, all diffusion model layers
remain on GPU. For large models like QwenImage (8.6GB), this leaves
insufficient VRAM for VAE decoding.

Solution: Add offload_streaming_layers() method to all streaming-enabled
models that moves all layers back to CPU before VAE decode.

Changes:
- Add offload_streaming_layers() to DiffusionModel base interface
- Implement in all runners: UNet, MMDiT, Flux, Anima, Wan, QwenImage, ZImage
- Add override methods in all Model wrapper classes
- Call offload_streaming_layers() in stable-diffusion.cpp before VAE decode

This enables running models larger than VRAM:
- QwenImage Edit (16GB model) now runs on 12GB GPU via layer_streaming
- Tested: Anima streaming produces identical output with ~1% overhead
- Add staged forward methods to QwenImageModel:
  - forward_input_stage(): patchify + input projections
  - forward_single_block(): execute one transformer block
  - forward_output_stage(): norm + proj + unpatchify

- Implement compute_streaming_true() for QwenImage that:
  - Executes each of the 60 transformer blocks as a separate mini-graph
  - Stores intermediate img/txt tensors in CPU memory between blocks
  - Loads/offloads ~140MB per block during execution
  - Enables running 8.5GB+ models on 12GB VRAM GPUs

- Update all model architectures (Flux, MMDiT, Anima, WAN, ZImage, UNet)
  with improved VRAM checking in compute_streaming()

This is true per-layer streaming where only ONE block's weights plus
activation memory is needed at any time, enabling models larger than
available VRAM to run.

Tested with Qwen-Image-Edit-2509-Q3_K_S.gguf (8.5GB) on RTX 3060 12GB.
…utput read

Bug: When compute() was called with free_compute_buffer_immediately=true,
the buffer holding output tensors was freed before ggml_backend_tensor_get()
could read them, causing "CUDA error: invalid device ordinal".

Fixes:
1. alloc_compute_buffer() now returns graph via out_gf parameter for reuse
2. compute() reuses graph from alloc_compute_buffer to avoid tensor mismatch
3. copy_data_to_backend_tensor() skips tensors without allocated buffers
4. All TRUE per-layer streaming stages now use free_compute_buffer_immediately=false
   and manually call free_compute_buffer() after reading outputs

Affected models: Flux, MMDiT, Anima, UNet, ZImage, QwenImage
- Add estimate_vae_encode_vram() for VRAM estimation before encoding
- Add smart_offload_for_vae_encode() to offload cond_stage and diffusion
  models before VAE encode operations
- Call smart_offload_for_vae_encode() before all encode_first_stage() and
  vae_encode() calls across generate_image and generate_video paths:
  - img2img init image encoding
  - ref image encoding (for edit modes)
  - control net image encoding
  - video frame encoding (WAN, VACE, Anima)

This prevents OOM during VAE encoding of large images by freeing VRAM
from models not needed during the encode phase. With layer_streaming mode,
this allows encoding images that previously caused OOM.
Key changes:
- Add async prefetch methods to LayerExecutionEngine: prefetch_layer(),
  wait_for_prefetch(), wait_for_all_prefetches()
- Add AsyncLoadState struct and async layer load methods to TensorRegistry:
  start_async_layer_load(), complete_async_layer_load()
- Use ggml_backend_tensor_copy_async() to overlap memory transfers with
  GPU computation during TRUE per-layer streaming
- Update qwen_image.hpp to start prefetching next block before computing
  current block, reducing GPU idle time
- Fix sd_offload_config_t initialization with correct field order
- Offload diffusion model layers to CPU at startup when layer_streaming
  mode is enabled, freeing VRAM for LLM/CLIP conditioning

This enables overlapped memory transfers during per-layer streaming,
reducing periodic GPU pauses caused by blocking PCIe transfers.
Adds async prefetching pattern to overlap PCIe memory transfer with GPU
computation during layer streaming. Before computing each block, prefetch
the next block's weights asynchronously.

Models updated:
- Flux: double_blocks and single_blocks loops
- UNet: input_blocks and output_blocks loops
- MMDiT: joint_blocks loop
- ZImage: layers loop
- Anima: blocks loop

Note: WAN model doesn't have true per-layer streaming yet (uses full graph).
When using CFG (multiple model calls per diffusion step), the VRAM check
didn't account for layers already loaded on GPU. This caused the second
CFG call to see full VRAM and switch to slow TRUE per-layer streaming.

Now tracks already_on_gpu and only checks remaining_to_load against
available VRAM. Second+ CFG calls complete in ~0.15s instead of 3+ seconds.

Applied to all 7 architectures: Flux, UNet, MMDiT, ZImage, Anima, WAN, QwenImage
fszontagh added 25 commits March 6, 2026 14:29
Extract common layer streaming infrastructure from 7 runners into
GGMLRunner base class: init_streaming(), analyze_vram_budget(),
load_all_layers_coarse(), is_streaming_enabled(), disable_layer_streaming(),
offload_streaming_layers(), get_streaming_engine(). Each runner's
enable_layer_streaming() is now ~4 lines and compute_streaming() ~20 lines.

Remove streaming_enabled_ bool from all runners — standardize on checking
engine config flag. Remove SDCPP_FORCE_TRUE_STREAMING and
SDCPP_FORCE_COARSE_STREAMING debug env vars.

Convert all Javadoc /** */ blocks to minimal // style and strip @param,
@return, @brief tags across streaming infrastructure and runner files.

Remove component prefixes from LOG calls: [LayerStreaming], [Offload],
FluxRunner:, ZImageRunner:, MMDiTRunner:, UNetRunner:, WanRunner:,
AnimaRunner:, QwenImageRunner:, IntermediateTensorManager:,
LayerExecutionEngine:, MemoryBudgetManager:, TensorRegistry:.
Document all offload modes, layer streaming internals, supported
architectures, usage examples, and quality impact of each technique.
Merge 34 upstream commits including sd::Tensor pipeline migration,
fused SwiGLU kernel, sampler refactoring, VAE optimization, spectrum
caching, webp support, and embedded WebUI. Added StreamingParamConverter
bridge and raw-tensor build_graph/compute overloads to preserve all
offloading/layer streaming infrastructure alongside upstream's new API.
Merge 38 upstream commits including sd-webui style Hires.fix support,
DPM++ 2S A and er_sde samplers, ernie image and SDXS-09 model support,
flux2 small decoder, restricted torch legacy checkpoint loading, and
major refactors: tokenizer module split, model_io module, examples
common split into header/source, async vid_gen API. Ported our offload
configuration into the new SDContextParams in common.h/common.cpp.
Upstream's rewrite of the sample loop replaced the explicit streaming
branch with a single compute() call, which routes to the bulk-allocate
path and OOMs when the model exceeds VRAM. Add compute_dispatch() that
selects compute_streaming() when layer streaming is enabled and bridges
its ggml_tensor* output back into sd::Tensor<float> for the new sampler.
compute_dispatch was allocating a 256 MB CPU-backed ggml_context per
sampling call to receive the streaming output. Replace with a no_alloc
context whose tensor metadata points directly at the destination
sd::Tensor's memory, eliminating the per-step malloc/free of 256 MB.
The main streaming loop was hardcoded to prefetch only one layer ahead,
ignoring the configured prefetch depth. Replace with a sliding window
that primes the first N layers and refills the prefetch slot each step,
where N comes from streaming_engine_->get_config().prefetch_layers.
This finally makes the prefetch_layers knob actually do something.
Every per-block streaming loop (anima, flux double/single, mmdit,
qwen_image, unet input/output, z_image) was hardcoded to prefetch only
one block ahead, ignoring streaming_prefetch_layers. Add prime_prefetch
and advance_prefetch helpers to LayerExecutionEngine and route every
runner through them.
Each main layer was destroying and recreating the ggml_gallocr_t
between iterations, idling the GPU during the rebuild. All main blocks
have the same shape, so the same allocator can serve every block of
every sampling step. Free only when transitioning to the output stage.
apply_loras_at_runtime always wrapped each model (cond_stage, diffusion,
first_stage) with a MultiLoraAdapter, even when no LoRA tensors matched
that model's prefix. The empty adapter routed every linear/conv through
forward_with_lora() instead of the direct kernel path. Skip the wrap
when the matching lora_models list is empty so unaffected models keep
the fast direct path.
The TRUE per-layer streaming path was unconditionally evicting every
block back to CPU after each forward pass, even when there was plenty
of free VRAM left. For an 8-step generation that re-streams the entire
model 7 extra times.

Decide once, on the first sampling step, how many leading blocks fit
permanently in VRAM (after subtracting prefetch headroom + compute
buffer + safety margin) and skip the eviction for those indices. Later
steps' prime_prefetch starts at the first non-resident block, so the
cache prefix is hit for free. Pattern follows ComfyUI's
ModelPatcher.partially_load() — a static partition is simpler and
cheaper than dynamic eviction for the cyclic-sequential access pattern
of diffusion sampling.

Also fix MemoryBudgetManager::query_device_memory(): the SD_USE_CUDA
guard was dead code after PR leejet#1448 switched to runtime backend
discovery, so every build was returning the hardcoded 8 GB / 4 GB
fallback regardless of the real GPU. Use ggml_backend_dev_memory()
instead — works for CUDA, Vulkan, Metal.

For ZImage 8 steps at 688x1024 on RTX 3060 12 GB:
  before: 7.21s/step steady, 57.80s sampling
  after:  4.45s/step steady, 39.64s sampling (1.46x)

Same caching helper (compute_resident_block_count) added to
LayerExecutionEngine and applied to z_image, flux (double + single),
mmdit, anima, qwen_image. UNet (skip connections) and WAN (no
per-layer streaming yet) unchanged.
When weights live on CPU but get transferred to GPU during compute,
allocate the params buffer from the GPU device's pinned host buffer
type. This makes ggml_backend_tensor_copy_async actually overlap with
compute on CUDA — without it, the backend silently falls back to a
staged sync copy through an internal bounce buffer.

For ZImage 8 steps with the layer cache from the previous commit:
  before: step1 8.10s, steady 4.45s, sampling 39.64s
  after:  step1 5.61s, steady 3.97s, sampling 33.95s (1.17x on top of cache)

Cold step gets the bigger win (-31%) because all 30 layers stream
once. Steady-state gain is smaller (-11%) because each streamed layer
still triggers a fresh cudaMalloc that serializes against the copy
stream — fixing that requires a buffer pool in tensor_registry, which
is a separate change.

One-time cost: model load takes longer because page-locking 11.7 GB of
host memory is slower than allocating pageable. Amortizes immediately
for any service that does more than one generation per load.

Falls back to pageable allocation if pinned alloc fails (system out of
locked pages). Applies to any GGMLRunner where params live on CPU but
runtime is GPU — diffusion model and CPU-resident LoRAs benefit;
clip-on-cpu paths skip cleanly because their runtime is also CPU.
The per-layer streaming loop bounces ~22 MB of activations through host
RAM between every layer (download txt_img output, re-upload as next
layer's input). With std::vector backing, the CUDA backend stages
those transfers through an internal pinned bounce buffer, which costs
roughly 16 ms per layer = 474 ms per sampling step.

Allocate the persistent_txt_img and persistent_t_emb backing storage
in a single GPU-pinned host buffer (via ggml_backend_dev_host_buffer_type)
so the same get/set calls run at full PCIe bandwidth. Falls back to
pageable std::vector if pinned alloc fails.

Also adds an opt-in per-step profile (SDCPP_STREAM_PROFILE=1) that
breaks out wait/load/advance/compute/tensor_get timings — used to
identify this hotspot and measure the fix.

For ZImage 8 steps at 688x1024 on RTX 3060 12 GB, prefetch=2:
  before: 33.95s sampling, ~3.97s/step steady, tensor_get=474 ms/step
  after:  29.32s sampling, ~3.45s/step steady, tensor_get=100 ms/step

Cumulative speedup across the layer-streaming work in this branch
(P1 cache + P2 pinned weights + P3a pinned activations): 58.31s → 29.32s,
just under 2x on the sampling loop for an 11.5 GB bf16 model on a 12 GB GPU.

The dominant remaining cost is `compute` itself (2.7 s/step), which is
graph build + gallocr + dispatch. Reducing that needs graph reuse
across layers — separate change.
The per-layer streaming loop was rebuilding the same DiT block graph
30 times per sampling step — same operations, just different weight
tensor instances. Profiling showed ~810 ms of pure CPU-side work per
step in graph build + gallocr (no GPU activity, GPU at 17 W / 46 °C).

Build the cgraph once for layer 0 and reuse it for layers 1..29 by
swapping the runtime tensor pointers (buffer/data/extra) between
layer 0 and layer N before each dispatch, then swapping back before
move_layer_to_cpu. All 30 main blocks share an identical
JointTransformerBlock structure, so the cached graph references valid
ops once layer N's data sits behind layer 0's tensor pointers.

Two new pieces:
- TensorRegistry::swap_layer_buffers(a,b) — exchanges the runtime
  buffer/data/extra fields between two structurally-identical layers.
- GGMLRunner::dispatch_cached_graph(gf) — runs alloc_graph + uploads
  + compute on a graph that's still alive in compute_ctx, skipping
  the build/reset cycle that compute() does each call.

Disabled when an at-runtime WeightAdapter (LoRA) is attached to the
runner: forward_with_lora() bakes layer-specific prefixes into the
adapter ops at graph-build time, so a cached graph would always apply
layer 0's LoRA delta to every layer. The fallback path is the
existing per-layer rebuild — bytewise identical output to before
this change (verified by md5 of the test image), so this is a free
improvement for non-LoRA workloads with zero risk for LoRA ones.

For ZImage 8 steps at 688x1024 on RTX 3060 12 GB, prefetch=2:
  with LoRA (fallback path):  29.19 s sampling (matches prior P3a)
  without LoRA (reuse active): 22.84 s sampling (1.28x vs P3a)
  steady step compute:         2710 ms -> 1890 ms (-30%)

Cumulative on the layer-streaming path vs the original baseline:
  with LoRA:    58.31s -> 29.19s (2.00x)
  without LoRA: 58.31s -> 22.84s (2.55x)

Also adds an opt-in per-step profile (SDCPP_STREAM_PROFILE=1) that
breaks out wait/load/advance/compute/tensor_get — used to identify
the build-cost hotspot this change targets.
z_image already pinned its persistent_txt_img / persistent_t_emb host
buffers (commit 9168495). The other DiT runners (flux, mmdit, anima,
qwen_image) still backed their per-block streaming activations with
pageable std::vector, forcing the CUDA backend to stage every
ggml_backend_tensor_get and copy_data_to_backend_tensor through an
internal bounce buffer.

Promote the pinning machinery onto GGMLRunner as a shared
ensure_pinned_act_buffers(sizes_bytes, out_ptrs) helper that allocates
a single GPU-pinned host buffer big enough for all the runner's
persistent activation regions and hands back 256-byte aligned start
pointers. Buffer is freed in ~GGMLRunner; falls back to pageable
std::vector if pinned alloc fails (output stays correct, just slower).

Each runner now declares its persistent_<name> regions as float* into
that shared buffer, with std::vector<float> fallbacks. Refactored
z_image to use the shared helper too — same bit-exact output as before
(verified: md5 of /tmp/bench_pin_smoke.png matches the previous P3a
baseline image).

For ZImage 8 steps at 688x1024 on RTX 3060 12GB, prefetch=2:
  before refactor: 29.32s sampling (already pinned)
  after refactor:  30.02s sampling (within run-to-run noise)

The bigger story: flux/mmdit/anima/qwen_image streaming users now get
the same ~10-15% activation-transfer speedup that z_image got from
P3a. Can't bench those directly without their respective models, but
the change is purely host-side memory allocation — same code path
ggml uses everywhere.
Streaming runs the K resident-on-GPU layers through one combined ggml graph
per step instead of building+dispatching a fresh tiny graph per layer. The
streamed-tail layers still use per-layer dispatch since their weights swap
in/out and topologies differ.

Adds a separate ggml_context, gallocr, and cgraph for the chunk on
ZImageRunner so the graph survives compute_ctx resets between streamed-tail
calls. Inputs (txt_img, t_emb, pe) are bound at chunk-graph build time and
re-uploaded each step via ggml_backend_tensor_set.

Measured on RTX 3060 / z_image_turbo bf16 / 8 steps:
  P3a baseline: 29.32s
  + chunk graph: 28.34s  (~3%)

Pixel-exact vs P3a baseline (md5 f54bf459...). Compounds with the dual-stream
H2D overlap on feature/pcie-overlap.
The Phase 4 chunk graph caches its input tensors (txt_img, t_emb, pe) with
the shapes from the first build, but token sequence length depends on the
prompt — different prompts produce different txt_img_ne[1]. Reusing the
cached graph in subsequent generate_image() calls left ggml_backend_tensor_set
writing the wrong byte count, the compute then ran on tensors with garbage
shape metadata, and ZImage layers eventually hit a divide-by-zero (SIGFPE).

Visible as sdcpp-restapi crashing on the second queue job.

Compares cached chunk_txt_img_in_/chunk_t_emb_in_/chunk_pe_ shapes against
the current call's; rebuilds the chunk graph if any shape (or the resident
layer count) differs.
The Phase 4 chunk-graph code (build / dispatch / shape-match / free) was
inlined into ZImageRunner. Moves it into a reusable helper in a new
src/chunk_graph.hpp so other DiT runners (flux, mmdit, anima, qwen_image,
unet, wan) can adopt it later by providing only:
  - the input shape vector,
  - a build callback that wires K layers using the supplied input tensors,
  - per-dispatch host data pointers.

The helper owns its own ggml_context + gallocr + cgraph, handles the cache
staleness rebuild from 4f445e2 internally, and exposes output() so callers
can read back the resulting tensor's shape.

ZImageRunner now stores a single LayerStreaming::ChunkGraph and provides
a small dispatch_resident_chunk wrapper that supplies the z_image-specific
build lambda (forward_layer_block over K resident layers).

Pixel-exact output preserved (verified vs P3a baseline).
GGMLRunner::prepare_build_in_tensor_before() creates two scalar tensors
(":one" / ":zero_int") on compute_ctx that op helpers like ggml_ext_full,
ggml_ext_zeros, ggml_ext_ones, and ggml_ext_cast_f32 look up by name via
ggml_get_tensor. The chunk graph uses a separate chunk_ctx_ that survives
across compute() calls, and those named tensors were never created on it
— so any lookup returned null and the next op SEGV'd.

Reproduces with short prompts: ggml_ext_attention_ext takes a KV-pad branch
that calls ggml_ext_full to build a -INF mask. Long prompts happen to
satisfy the alignment and skip that branch, which is why the bug stayed
hidden until a "a cat"-class prompt hit per-layer streaming with the chunk
graph engaged.

Mirrors prepare_build_in_tensor_before/after on chunk_ctx_: creates the
two named tensors before build_fn runs, adds them to the graph after, and
uploads the constant scalar values (1.0f / 0i) on every dispatch.
apply_loras_at_runtime() creates a fresh MultiLoraAdapter per call, replacing
the diffusion model's weight_adapter shared_ptr. The cached chunk graph still
holds raw ggml_tensor* references into ops emitted by the previous adapter —
once the old adapter is destroyed, those tensors are freed and the cache
becomes a use-after-free trap.

Adds an opaque state_token parameter to ChunkGraph::ensure_built that gets
compared alongside K and shapes; mismatch frees the cache and rebuilds. The
caller (z_image) fingerprints its weight_adapter pointer plus the runner
boolean flags (flash_attn / conv2d_direct / circular_x / circular_y) into
the token, so any of those changing across queue jobs forces a rebuild.

This is the third in a series fixing pre-existing Phase 4 bugs:
- 4f445e2: shape staleness across jobs (different prompt token counts)
- 836b0b1: missing build-in tensors (one / zero_int) in chunk_ctx
- this:    weight_adapter use-after-free across LoRA swaps
Phase 4's chunk graph and the resident-layer cache held GPU memory across
generate_image() calls indefinitely:

- The cached chunk graph kept its compute buffer (~500 MB) and references
  into the resident layers' GPU tensors.
- resident_layer_count_ was set once and never reset, so every subsequent
  call left the same 19 layers (~7.7 GB) on GPU even after
  offload_streaming_layers() evicted them. The chunk graph then carried
  pointers into the freed memory.

In long-running processes (sdcpp-restapi) with LoRA at_runtime, every
generation creates a fresh MultiLoraAdapter — state_token changes, so
ChunkGraph rebuilds. Each rebuild called clear() but the previous cache
plus stale pointers from earlier jobs accumulated VRAM until cudaMalloc
failed mid-generation (saw 9.8 GB used / 0.6 GB free after 4 jobs, OOM on
job 5).

Adds a virtual on_streaming_layers_offloaded() hook in GGMLRunner, called
at the end of offload_streaming_layers(). ZImageRunner overrides it to
clear chunk_graph_ and reset resident_layer_count_ so the next generation
recomputes the resident set against the actual free VRAM and builds a
clean chunk graph.

Verified on RTX 3060: 4 batch=4 / 12-step LoRA jobs back-to-back, VRAM
holds steady at ~9.7 GB free between jobs (was 0.6 GB before), per-job
time stable at 180-184s, no OOM. Within-generation reuse (12 steps × 4
batch images = 48 dispatches share one chunk graph) is preserved, so the
sampling speed is unchanged.
Brings in upstream's leejet#1476 max-vram graph-cut segmented param offload
alongside our layer-streaming work. Both mechanisms coexist:

- `--offload-mode layer_streaming` (ours) — per-layer streaming with
  prefetch, chunk graph for resident block. ~2× faster than graph-cut
  on bf16/12GB GPU based on A/B bench (175s vs 335s generate_image).
- `--offload-to-cpu --max-vram <GiB>` (upstream leejet#1476) — static
  segment plan, swap params per segment.

Notable conflict resolutions in src/ggml_extend.hpp:
- Kept upstream's two-step prepare_compute_graph + alloc_compute_buffer(gf)
  as the canonical path; added a backward-compatible single-call overload
  `alloc_compute_buffer(get_graph_cb_t, ggml_cgraph**)` that wraps both
  for the layer-streaming caller.
- copy_data_to_backend_tensor() ggml_cgraph parameter is now optional
  (default nullptr) — graph-cut passes the graph for filtering, layer
  streaming passes nullptr to upload everything from the map.
- free_compute_buffer() does both restore_partial_params/restore_all_params
  (upstream graph-cut cleanup) and our auto_offload_after_compute hook.
- compute() body takes upstream's graph-cut-aware dispatch verbatim.

Renamed our internal helpers to match upstream:
- offload_params_to_runtime_backend → offload_all_params
- offload_params_to_params_backend → restore_all_params

Dropped sd_ctx_params_t::flow_shift; upstream moved it to
sd_sample_params_t (sd_sample_params_init initialises it there).

Verified: Z-Image-Turbo Q8 builds and runs end-to-end on both paths
(layer_streaming and --max-vram) in this binary.
The destructor previously released runtime_params_buffer but missed
partial_runtime_params_buffer (the buffer used by the segmented param
offload path added in leejet#1476). On runner destruction with --max-vram
active, that GPU memory leaked.

Same class of leak as the existing runtime_params_buffer fix.
@fszontagh fszontagh force-pushed the feature/vram-offloading-v2 branch from f6815d6 to dc8e9e2 Compare May 6, 2026 19:46
fszontagh added 4 commits May 6, 2026 22:20
Per-layer streaming runs many short kernels and waits on each one. The
CUDA driver default schedule (cudaDeviceScheduleAuto) often picks Spin,
which busy-waits one host thread on each kernel return - shows as 100%
on one CPU core in top/nvtop even though the wait is idle work.

Document two fixes: CUDA_DEVICE_SCHEDULE=BlockingSync env var for
single-shot CLI runs, or cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)
at process startup for long-lived servers.

No code change here - just user-facing guidance to avoid the
"why is my CPU at 100%" question.
The frontend submodule pointer carried over from our fork was a SHA
from an older repo (leejet/stable-ui) that doesn't exist on
leejet/sdcpp-webui (the URL declared in .gitmodules). CI couldn't fetch
it and every job failed at the submodule init step.

Sync to upstream master's SHA (797ccf8). The webui isn't part of the
offload work and we don't need a fork-local version on this branch.
Two related budget-planner fixes for our streaming path:

1. Propagate --max-vram into MemoryBudgetManager so the same flag drives
   both leejet's graph-cut path and our layer-streaming planner. Lets
   users simulate a smaller card without a separate flag. The cap is
   applied via init_streaming() after the engine is created so it
   survives whichever order set_max_graph_vram_bytes() and the engine
   construction happen in.

2. Reserve a compute-buffer slice (default 768 MB, matches
   compute_resident_block_count's existing convention) when deciding
   coarse-stage vs per-layer in analyze_vram_budget(). Without this,
   params can fit in capped VRAM but params + CB tip over mid-step
   and crash cudaMalloc — visible on SDXL 1024x1024 with --max-vram 6
   where the compute graph wants 830 MB on top of 4.79 GB params.
UNet's compute_streaming had four bugs that didn't surface until SDXL
+ --max-vram pushed the planner into per-layer mode:

1. Coarse-stage path called regular compute() without
   skip_param_offload=true, double-allocating UNet params on the
   runtime backend (4.79 GB ZImage, 4.79 GB SDXL). Other architectures
   already pass true; only unet.hpp was missing it.

2. forward_input_block() called resblock_forward() for every
   input_blocks.X.0 entry, but at indices 3 and 6 the slot is a
   DownSampleBlock — the dynamic_pointer_cast<ResBlock> returned
   null and the next forward() segfaulted silently. Now dispatches
   DownSampleBlock vs ResBlock by actual type.

3. forward_output_block() called attention_layer_forward() for
   output_blocks.X.1, but on SD1.x's deepest output block (no
   attention at that resolution) the slot holds an UpSampleBlock,
   producing the same null-cast crash. Now walks .1 and .2 once
   each and dispatches UpSampleBlock vs SpatialTransformer by type.

4. get_num_input_blocks()/get_num_output_blocks() returned a
   hardcoded 12. SDXL has 9, tiny_unet variants have gaps. Replaced
   with a scan of the blocks map for the actual max index, so the
   streaming loop iterates over indices the model actually has.

Verified with --max-vram cap forcing per-layer streaming on SDXL
1024x1024, SD1.5 512x512, plus regression on Z-Image bf16, Z-Image
Q8, Flux schnell, Chroma, Anima, Qwen Image, and SD3.5 Large.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant