leejet · fszontagh · Feb 24, 2026 · Feb 24, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/docs/vram_offloading.md b/docs/vram_offloading.md
@@ -0,0 +1,112 @@
+# VRAM Offloading
+
+Run models larger than your GPU memory by offloading weights to CPU RAM during generation.
+
+## Offload Modes
+
+Use `--offload-mode <mode>` to select the offloading strategy:
+
+| Mode | Description | VRAM Usage | Speed | Quality |
+|------|-------------|------------|-------|---------|
+| `none` | Everything stays on GPU (default) | Highest | Fastest | No penalty |
+| `cond_only` | Offload text encoder after conditioning | High | Near-full speed — only a brief reload between conditioning and diffusion | No penalty |
+| `cond_diffusion` | Offload both text encoder and diffusion model between stages | Medium | Slower — model is reloaded to GPU each diffusion step | No penalty |
+| `aggressive` | Aggressively offload all components when not in use | Low | Slowest of the non-streaming modes — frequent CPU↔GPU transfers | No penalty |
+| `layer_streaming` | Stream transformer layers one-by-one through GPU | Lowest | Depends on model size (see below) | No penalty when using coarse-stage; per-layer streaming is lossless for most architectures |
+
+The `--offload-to-cpu` flag is a shortcut that picks a reasonable offload mode automatically.
+
+## Layer Streaming
+
+Layer streaming is the most memory-efficient mode. Instead of loading the entire diffusion model into VRAM, it loads one transformer block at a time.
+
+### How it works
+
+1. **Coarse-stage**: If the model fits in VRAM (e.g., quantized models), all layers are loaded at once and the full graph is executed normally. This is as fast as `--offload-mode none` with no quality penalty — the only overhead is the initial CPU→GPU weight transfer.
+2. **Per-layer streaming**: If the model doesn't fit (e.g., bf16 models on small GPUs), each transformer block is loaded, executed as a mini-graph, then offloaded back to CPU before the next block. This uses minimal VRAM but is significantly slower due to per-step CPU↔GPU transfers. Output quality is identical to full-model execution — the computation is mathematically equivalent, just split across separate graph evaluations.
+
+The mode is chosen automatically based on available VRAM.
+
+### Supported architectures
+
+- Flux (double_blocks + single_blocks)
+- ZImage / Z-Image-Turbo (context_refiner + noise_refiner + layers)
+- MMDiT / SD3 (joint_blocks)
+- UNet / SD1.x / SDXL (input_blocks + middle_block + output_blocks)
+- Anima (blocks)
+- WAN (blocks + vace_blocks)
+- Qwen Image (transformer_blocks)
+
+### Examples
+
+#### ZImage-Turbo Q8 with layer streaming
+
+```
+sd-cli --diffusion-model z_image_turbo-Q8_0.gguf \
+  --llm Qwen3-4b-Z-Engineer-V2.gguf \
+  --vae ae.safetensors \
+  -p "a cat" --cfg-scale 1.0 --diffusion-fa \
+  -H 1024 -W 688 -s 42 \
+  --offload-mode layer_streaming -v
+```
+
+The Q8 model (6.7 GB) fits in a 12 GB GPU, so coarse-stage streaming is used automatically:
+```
+[INFO ] z_image model fits in VRAM, using coarse-stage streaming
+[INFO ] z_image coarse-stage streaming completed in 1.66s
+```
+
+#### Flux-dev Q4 with layer streaming
+
+```
+sd-cli --diffusion-model flux1-dev-q4_0.gguf \
+  --vae ae.safetensors \
+  --clip_l clip_l.safetensors \
+  --t5xxl t5xxl_fp16.safetensors \
+  -p "a lovely cat" --cfg-scale 1.0 --sampling-method euler \
+  --offload-mode layer_streaming -v
+```
+
+#### SD1.5 with aggressive offloading
+
+```
+sd-cli -m sd-v1-4.ckpt \
+  -p "a photograph of an astronaut riding a horse" \
+  --offload-mode aggressive -v
+```
+
+## Combining with other options
+
+- `--diffusion-fa`: Flash attention reduces VRAM further. Recommended with all offload modes. No quality penalty.
+- `--clip-on-cpu`: Run CLIP text encoder on CPU. Saves VRAM but slows conditioning. No quality penalty.
+- Quantized models (`q4_0`, `q8_0`, etc.) reduce model size, making coarse-stage streaming more likely (faster). **Quantization does reduce output quality** — lower bit depths produce softer details and may introduce artifacts. See [quantization](./quantization_and_gguf.md) for quality comparisons. `q8_0` is nearly indistinguishable from full precision; `q4_0` and below show visible degradation on fine details.
+
+## Quality impact summary
+
+| Technique | Quality Impact |
+|-----------|---------------|
+| `--offload-mode` (any mode) | **None** — offloading only changes where weights are stored, not the computation |
+| `--diffusion-fa` (flash attention) | **None** — mathematically equivalent, just more memory-efficient |
+| `--clip-on-cpu` | **None** — same computation on CPU instead of GPU |
+| Quantization (`q8_0`) | **Negligible** — nearly identical to full precision |
+| Quantization (`q4_0`, `q4_k`) | **Minor** — slight softening, fine details may differ |
+| Quantization (`q3_k`, `q2_k`) | **Noticeable** — visible quality loss, best for previews or VRAM-constrained setups |
+
+## Troubleshooting
+
+- **OOM during generation**: Try a more aggressive mode. `layer_streaming` uses the least VRAM.
+- **Slow generation**: Coarse-stage streaming (model fits in VRAM) is nearly as fast as no offloading. Per-layer streaming is slower due to CPU-GPU transfers each step. Using quantized models often lets you stay in coarse-stage mode.
+- **Black or corrupted output**: This is a bug. Please report it with the model, offload mode, and resolution used.
+- **One CPU core pegged at 100% while the GPU is working**: this is the CUDA driver spin-waiting on kernel completion. The default schedule policy (`cudaDeviceScheduleAuto`) often picks `Spin` for short-kernel workloads like per-layer streaming, which busy-waits one host thread for each kernel return. It does *not* slow generation down (the wait is wasted heat, not blocking work), but it looks bad on `top`/`nvtop` and is unfriendly to shared-host setups. Two ways to silence it:
+
+  1. Per-run, no rebuild needed:
+     ```
+     CUDA_DEVICE_SCHEDULE=BlockingSync sd-cli ...
+     ```
+  2. Per-process, set once at startup:
+     ```c
+     cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
+     ```
+     Long-lived processes (REST servers, queue workers) should do this.
+
+  CPU drops to near zero; GPU performance is unchanged.
diff --git a/examples/cli/main.cpp b/examples/cli/main.cpp
@@ -698,7 +698,10 @@ int main(int argc, const char* argv[]) {
         vae_decode_only = false;
     }
 
-    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview);
+    // For layer_streaming mode, we need smart offload logic instead of immediate freeing
+    // This allows should_offload_cond_stage_for_diffusion() to be called and offload T5 before streaming
+    bool free_params_immediately   = (ctx_params.offload_config.mode != SD_OFFLOAD_LAYER_STREAMING);
+    sd_ctx_params_t sd_ctx_params  = ctx_params.to_sd_ctx_params_t(vae_decode_only, free_params_immediately, cli_params.taesd_preview);
 
     SDImageVec results;
     int num_results = 0;

diff --git a/examples/common/common.cpp b/examples/common/common.cpp
@@ -538,6 +538,78 @@ ArgOptions SDContextParams::get_options() {
         return 1;
     };
 
+    auto on_offload_mode_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg     = argv[index];
+        offload_config.mode = str_to_offload_mode(arg);
+        if (offload_config.mode == SD_OFFLOAD_MODE_COUNT) {
+            LOG_ERROR("error: invalid offload mode %s", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_vram_estimation_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg                = argv[index];
+        offload_config.vram_estimation = str_to_vram_estimation(arg);
+        if (offload_config.vram_estimation == SD_VRAM_EST_COUNT) {
+            LOG_ERROR("error: invalid VRAM estimation method %s", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_streaming_prefetch_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        try {
+            offload_config.streaming_prefetch_layers = std::stoi(argv[index]);
+            if (offload_config.streaming_prefetch_layers < 0) {
+                LOG_ERROR("error: streaming prefetch must be >= 0");
+                return -1;
+            }
+        } catch (...) {
+            LOG_ERROR("error: invalid streaming prefetch value %s", argv[index]);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_streaming_min_vram_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        try {
+            int mb = std::stoi(argv[index]);
+            if (mb < 0) {
+                LOG_ERROR("error: streaming min VRAM must be >= 0");
+                return -1;
+            }
+            offload_config.streaming_min_free_vram = static_cast<size_t>(mb) * 1024 * 1024;
+        } catch (...) {
+            LOG_ERROR("error: invalid streaming min VRAM value %s", argv[index]);
+            return -1;
+        }
+        return 1;
+    };
+
+    options.bool_options.push_back({"", "--offload-log", "log offload events", true, &offload_config.log_offload_events});
+    options.bool_options.push_back({"", "--no-offload-log", "do not log offload events", false, &offload_config.log_offload_events});
+    options.bool_options.push_back({"", "--offload-cond-stage", "offload cond stage to CPU after use", true, &offload_config.offload_cond_stage});
+    options.bool_options.push_back({"", "--no-offload-cond-stage", "do not offload cond stage", false, &offload_config.offload_cond_stage});
+    options.bool_options.push_back({"", "--offload-diffusion", "offload diffusion model to CPU after use", true, &offload_config.offload_diffusion});
+    options.bool_options.push_back({"", "--no-offload-diffusion", "do not offload diffusion model", false, &offload_config.offload_diffusion});
+    options.bool_options.push_back({"", "--reload-cond-stage", "reload cond stage to GPU before use", true, &offload_config.reload_cond_stage});
+    options.bool_options.push_back({"", "--no-reload-cond-stage", "do not reload cond stage", false, &offload_config.reload_cond_stage});
+    options.bool_options.push_back({"", "--reload-diffusion", "reload diffusion to GPU before use", true, &offload_config.reload_diffusion});
+    options.bool_options.push_back({"", "--no-reload-diffusion", "do not reload diffusion", false, &offload_config.reload_diffusion});
+
     options.manual_options = {
         {"",
          "--type",
@@ -564,6 +636,24 @@ ArgOptions SDContextParams::get_options() {
          "but it usually offers faster inference speed and, in some cases, lower memory usage. "
          "The at_runtime mode, on the other hand, is exactly the opposite.",
          on_lora_apply_mode_arg},
+        {"",
+         "--offload-mode",
+         "dynamic VRAM offloading mode, one of [none, cond_only, cond_diffusion, aggressive, layer_streaming] (default: none). "
+         "Use 'cond_only' to offload the LLM/CLIP model to CPU after conditioning. "
+         "Use 'layer_streaming' to stream model layers one-by-one (enables models larger than VRAM).",
+         on_offload_mode_arg},
+        {"",
+         "--vram-estimation",
+         "VRAM estimation method for smart offloading, one of [dryrun, formula] (default: dryrun)",
+         on_vram_estimation_arg},
+        {"",
+         "--streaming-prefetch",
+         "Number of layers to prefetch ahead during layer streaming (default: 1)",
+         on_streaming_prefetch_arg},
+        {"",
+         "--streaming-min-vram",
+         "Minimum VRAM to keep free during layer streaming, in MB (default: 512)",
+         on_streaming_min_vram_arg},
     };
 
     return options;
@@ -693,7 +783,14 @@ std::string SDContextParams::to_string() const {
         << "  chroma_t5_mask_pad: " << chroma_t5_mask_pad << ",\n"
         << "  prediction: " << sd_prediction_name(prediction) << ",\n"
         << "  lora_apply_mode: " << sd_lora_apply_mode_name(lora_apply_mode) << ",\n"
-        << "  force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << "\n"
+        << "  force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << ",\n"
+        << "  offload_config: { mode=" << sd_offload_mode_name(offload_config.mode)
+        << ", vram_est=" << sd_vram_estimation_name(offload_config.vram_estimation)
+        << ", offload_cond=" << (offload_config.offload_cond_stage ? "true" : "false")
+        << ", offload_diff=" << (offload_config.offload_diffusion ? "true" : "false")
+        << ", reload_cond=" << (offload_config.reload_cond_stage ? "true" : "false")
+        << ", reload_diff=" << (offload_config.reload_diffusion ? "true" : "false")
+        << ", log=" << (offload_config.log_offload_events ? "true" : "false") << " }\n"
         << "}";
     return oss.str();
 }
@@ -751,6 +848,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
         chroma_t5_mask_pad,
         qwen_image_zero_cond_t,
         max_vram,
+        offload_config,
     };
     return sd_ctx_params;
 }

diff --git a/examples/common/common.h b/examples/common/common.h
@@ -135,6 +135,12 @@ struct SDContextParams {
     bool force_sdxl_vae_conv_scale = false;
 
     float flow_shift = INFINITY;
+
+    // Dynamic tensor offloading configuration
+    sd_offload_config_t offload_config = {SD_OFFLOAD_NONE, SD_VRAM_EST_DRYRUN, true, false, false, true, true,
+                                          0, 2ULL * 1024 * 1024 * 1024,
+                                          false, 1, 0, 512ULL * 1024 * 1024};
+
     ArgOptions get_options();
     void build_embedding_map();
     bool resolve(SDMode mode);

diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h
@@ -147,6 +147,53 @@ enum lora_apply_mode_t {
     LORA_APPLY_MODE_COUNT,
 };
 
+// Component identifiers for dynamic tensor offloading
+enum sd_component_t {
+    SD_COMPONENT_COND_STAGE,   // LLM/CLIP text embedder
+    SD_COMPONENT_CLIP_VISION,  // CLIP vision encoder (for SVD/Wan i2v)
+    SD_COMPONENT_DIFFUSION,    // UNet/DiT/Flux diffusion model
+    SD_COMPONENT_VAE,          // VAE encoder/decoder
+    SD_COMPONENT_CONTROL_NET,  // ControlNet (if loaded)
+    SD_COMPONENT_PMID,         // PhotoMaker ID encoder (if loaded)
+    SD_COMPONENT_COUNT
+};
+
+// Offload mode for automatic GPU memory management
+enum sd_offload_mode_t {
+    SD_OFFLOAD_NONE,           // Keep all components on GPU (default, fastest)
+    SD_OFFLOAD_COND_ONLY,      // Offload only conditioning (LLM/CLIP) after use
+    SD_OFFLOAD_COND_DIFFUSION, // Offload conditioning + diffusion, keep VAE
+    SD_OFFLOAD_AGGRESSIVE,     // Offload each component after use (saves most VRAM)
+    SD_OFFLOAD_LAYER_STREAMING, // Stream layers one-by-one (enables models larger than VRAM)
+    SD_OFFLOAD_MODE_COUNT
+};
+
+// VRAM estimation method for smart offloading decisions
+enum sd_vram_estimation_t {
+    SD_VRAM_EST_DRYRUN,        // Dry-run graph allocation for exact size (default, accurate)
+    SD_VRAM_EST_FORMULA,       // Formula-based estimation (faster, approximate)
+    SD_VRAM_EST_COUNT
+};
+
+// Offload configuration for fine-grained control
+typedef struct {
+    enum sd_offload_mode_t mode;          // Offload mode
+    enum sd_vram_estimation_t vram_estimation; // VRAM estimation method
+    bool offload_cond_stage;              // Offload LLM/CLIP after conditioning
+    bool offload_diffusion;               // Offload diffusion model after sampling
+    bool reload_cond_stage;               // Reload LLM/CLIP for next generation
+    bool reload_diffusion;                // Reload diffusion model for next generation
+    bool log_offload_events;              // Log offload/reload events
+    size_t min_offload_size;              // Minimum component size to offload (bytes), 0 = no minimum
+    size_t target_free_vram;              // Target free VRAM before VAE decode (bytes), 0 = always offload when mode is set
+
+    // Layer streaming configuration (for SD_OFFLOAD_LAYER_STREAMING mode)
+    bool layer_streaming_enabled;         // Enable layer-by-layer streaming execution
+    int streaming_prefetch_layers;        // Number of layers to prefetch ahead (default: 1)
+    int streaming_keep_layers_behind;     // Layers to keep after execution (for skip connections)
+    size_t streaming_min_free_vram;       // Minimum VRAM to keep free during streaming (bytes)
+} sd_offload_config_t;
+
 typedef struct {
     bool enabled;
     int tile_size_x;
@@ -203,7 +250,8 @@ typedef struct {
     bool chroma_use_t5_mask;
     int chroma_t5_mask_pad;
     bool qwen_image_zero_cond_t;
-    float max_vram;
+    float max_vram;                       // GiB budget for graph-cut segmented param offload (0 = disabled)
+    sd_offload_config_t offload_config;   // Cross-stage and layer-streaming offload configuration
 } sd_ctx_params_t;
 
 typedef struct {
@@ -393,6 +441,11 @@ SD_API const char* sd_preview_name(enum preview_t preview);
 SD_API enum preview_t str_to_preview(const char* str);
 SD_API const char* sd_lora_apply_mode_name(enum lora_apply_mode_t mode);
 SD_API enum lora_apply_mode_t str_to_lora_apply_mode(const char* str);
+SD_API const char* sd_offload_mode_name(enum sd_offload_mode_t mode);
+SD_API enum sd_offload_mode_t str_to_offload_mode(const char* str);
+SD_API const char* sd_vram_estimation_name(enum sd_vram_estimation_t method);
+SD_API enum sd_vram_estimation_t str_to_vram_estimation(const char* str);
+SD_API void sd_offload_config_init(sd_offload_config_t* config);
 SD_API const char* sd_hires_upscaler_name(enum sd_hires_upscaler_t upscaler);
 SD_API enum sd_hires_upscaler_t str_to_sd_hires_upscaler(const char* str);
 
@@ -411,6 +464,9 @@ SD_API char* sd_sample_params_to_str(const sd_sample_params_t* sample_params);
 SD_API enum sample_method_t sd_get_default_sample_method(const sd_ctx_t* sd_ctx);
 SD_API enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_method_t sample_method);
 
+// Get the model architecture/version name (e.g., "SD 1.x", "SDXL", "Flux", "Z-Image", etc.)
+SD_API const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx);
+
 SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
 SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
 SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);
@@ -450,6 +506,34 @@ SD_API bool preprocess_canny(sd_image_t image,
 SD_API const char* sd_commit(void);
 SD_API const char* sd_version(void);
 
+// Dynamic tensor offloading API
+// These functions allow runtime GPU memory management by moving model components
+// between CPU and GPU. This enables running larger models on limited VRAM by
+// keeping only the currently-active component on GPU.
+
+// Offload component from GPU to CPU (frees GPU memory)
+// Returns true on success, false if component doesn't exist or is already on CPU
+SD_API bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Reload component from CPU to GPU (allocates GPU memory)
+// Returns true on success, false if component doesn't exist or allocation failed
+SD_API bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Query whether component is currently on GPU
+// Returns true if on GPU, false if on CPU or component doesn't exist
+SD_API bool sd_is_on_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Get component's current memory usage in bytes
+// Returns the buffer size if component exists, 0 otherwise
+SD_API size_t sd_get_component_vram(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Get human-readable name for a component
+SD_API const char* sd_component_name(enum sd_component_t component);
+
+// Free all GPU resources (offload all components to CPU and clear LoRAs)
+// Call this before unloading a model to ensure GPU memory is released
+SD_API void sd_free_gpu_resources(sd_ctx_t* sd_ctx);
+
 #ifdef __cplusplus
 }
 #endif