Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
6bc4025
feat: add dynamic VRAM offloading for large models
fszontagh Feb 24, 2026
d2f2836
feat(cli): add --offload-mode option for dynamic VRAM offloading
fszontagh Feb 24, 2026
ef40d61
fix: prevent SEGV when GPU reload fails during offload
fszontagh Feb 25, 2026
57a12c0
fix: offload cond_stage before LoRA application when using offload mode
fszontagh Feb 25, 2026
e6ea65e
Fix LoRA + offload VRAM conflict with retry mechanism
fszontagh Feb 25, 2026
c42b659
Use memory-based offload for LoRA instead of disk reload
fszontagh Feb 25, 2026
aa51517
Fix GPU memory leak in GGMLRunner destructor
fszontagh Feb 25, 2026
7839a54
Add smart VAE offload with dry-run VRAM estimation
fszontagh Feb 25, 2026
b454eb1
Implement smart VRAM-based offload decisions
fszontagh Feb 25, 2026
1febdbe
Add configurable reload_diffusion option for post-generation behavior
fszontagh Feb 25, 2026
f324b03
Add CLI options for all offload configuration settings
fszontagh Feb 25, 2026
af8f5fa
Add granular tensor offloading infrastructure
fszontagh Feb 28, 2026
cb82950
Fix layer registration: use tensor map instead of raw GGML context
fszontagh Mar 1, 2026
d3b989d
Skip bulk GPU allocation in layer_streaming mode + improve tensor tra…
fszontagh Mar 1, 2026
55a837a
Fix streaming mode: add to_backend conversion and skip_param_offload
fszontagh Mar 1, 2026
d109d00
Add layer streaming support for MMDiT/SD3
fszontagh Mar 1, 2026
5235117
Add layer streaming support for WAN video models
fszontagh Mar 1, 2026
b013674
Add layer streaming support for QwenImage and ZImage
fszontagh Mar 1, 2026
26c5e12
Add ref_latents support to Flux streaming (WIP)
fszontagh Mar 1, 2026
7a8dd53
Fix Flux streaming: use coarse-stage approach like UNet
fszontagh Mar 1, 2026
2037666
Fix layer_streaming mode: force T5 offload before diffusion
fszontagh Mar 1, 2026
07c08e1
Add layer streaming support for Anima model
fszontagh Mar 1, 2026
8272591
Fix AnimaConditioner offloading and layer_streaming state consistency
fszontagh Mar 1, 2026
a1b486a
Add offload_streaming_layers to free GPU memory before VAE decode
fszontagh Mar 1, 2026
258bb14
Implement true per-layer streaming for QwenImage
fszontagh Mar 1, 2026
c3e52a6
Fix TRUE per-layer streaming: defer compute buffer free until after o…
fszontagh Mar 2, 2026
c66f0d0
Add pre-VAE-encode offloading to prevent OOM during image encoding
fszontagh Mar 2, 2026
ebb8ddb
Implement async layer prefetching for layer streaming mode
fszontagh Mar 2, 2026
142013c
Add async prefetching to all TRUE per-layer streaming models
fszontagh Mar 2, 2026
582acb3
Fix CFG causing redundant model loading in layer streaming mode
fszontagh Mar 3, 2026
e220c67
Fix ZImage TRUE per-layer streaming: load refiner layers before refin…
fszontagh Mar 3, 2026
be36ea0
Disable broken TRUE per-layer streaming for ZImage, fall back to norm…
fszontagh Mar 3, 2026
10ee7a6
Add comprehensive debug logging for ZImage TRUE per-layer streaming
fszontagh Mar 3, 2026
e546fa6
Add extensive debugging for ZImage TRUE per-layer streaming
fszontagh Mar 3, 2026
7e59edb
Remove misleading input buffer check after GGML compute
fszontagh Mar 3, 2026
2ad9c8c
Clean up ZImage TRUE per-layer streaming debug code
fszontagh Mar 3, 2026
6fd7efa
Reduce verbose DEBUG logging in layer streaming
fszontagh Mar 3, 2026
88fd0b2
Fix t_emb buffer aliasing in ZImage TRUE per-layer streaming
fszontagh Mar 4, 2026
117f647
Fix non-streaming offload modes (cond_only, cond_diffusion, aggressiv…
fszontagh Mar 4, 2026
98d7f6c
Deduplicate streaming code into GGMLRunner and align style with upstream
fszontagh Mar 6, 2026
1ad143c
Add VRAM offloading documentation
fszontagh Mar 6, 2026
f865af4
Merge upstream/master: sd::Tensor migration, webp, spectrum caching
fszontagh Apr 4, 2026
8c3ad49
Merge upstream/master: hires fix, more samplers, tokenizer split
fszontagh Apr 27, 2026
77127d8
Fix layer streaming dispatch lost during upstream merge
fszontagh Apr 29, 2026
39fca39
Avoid 256 MB scratch alloc per streaming dispatch
fszontagh Apr 29, 2026
0da04f1
Honour streaming_prefetch_layers in z_image streaming loop
fszontagh Apr 29, 2026
b759cd2
Honour streaming_prefetch_layers across all DiT/UNet runners
fszontagh Apr 29, 2026
b705b36
Reuse compute buffer across z_image streaming layers
fszontagh Apr 29, 2026
7114e8c
Skip empty MultiLoraAdapter when no LoRAs target a model
fszontagh Apr 29, 2026
0509ad9
Merge remote-tracking branch 'upstream/master' into feature/vram-offl…
fszontagh May 1, 2026
e53f621
Cache resident layers across sampling steps in DiT streaming runners
fszontagh May 4, 2026
71e9c77
Allocate streamed weights in pinned host memory
fszontagh May 4, 2026
9168495
Pin host activation buffers in z_image streaming loop
fszontagh May 4, 2026
41c3ca2
Reuse a single layer graph across all z_image streaming layers
fszontagh May 4, 2026
b029a77
Revert "Reuse a single layer graph across all z_image streaming layers"
fszontagh May 4, 2026
00086a2
Pin host activation buffers across all DiT streaming runners
fszontagh May 4, 2026
44c1f99
Build a chunk graph for resident z_image streaming layers
fszontagh May 4, 2026
4f445e2
Rebuild z_image chunk graph when input shapes change
fszontagh May 5, 2026
857e9e0
Extract chunk-graph machinery into shared LayerStreaming::ChunkGraph
fszontagh May 5, 2026
836b0b1
ChunkGraph: create runner build-in tensors on the chunk context
fszontagh May 5, 2026
551ab2d
ChunkGraph: invalidate cache when weight_adapter or runner flags change
fszontagh May 5, 2026
43974de
Drop chunk graph + reset resident layers on layer offload
fszontagh May 5, 2026
2ad56ac
Merge upstream master into feature/vram-offloading-v2
fszontagh May 6, 2026
dc8e9e2
Free partial_runtime_params_buffer in GGMLRunner destructor
fszontagh May 6, 2026
1e9c287
docs(vram_offloading): note CPU spin-wait when --offload-mode is active
fszontagh May 6, 2026
6bcada3
Reset examples/server/frontend submodule to upstream's SHA
fszontagh May 6, 2026
5b19131
Hook --max-vram into layer-streaming budget + reserve CB headroom
fszontagh May 6, 2026
0fd40e5
Fix UNet layer_streaming under tight VRAM cap (SDXL/SD1.x)
fszontagh May 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/vram_offloading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# VRAM Offloading

Run models larger than your GPU memory by offloading weights to CPU RAM during generation.

## Offload Modes

Use `--offload-mode <mode>` to select the offloading strategy:

| Mode | Description | VRAM Usage | Speed | Quality |
|------|-------------|------------|-------|---------|
| `none` | Everything stays on GPU (default) | Highest | Fastest | No penalty |
| `cond_only` | Offload text encoder after conditioning | High | Near-full speed — only a brief reload between conditioning and diffusion | No penalty |
| `cond_diffusion` | Offload both text encoder and diffusion model between stages | Medium | Slower — model is reloaded to GPU each diffusion step | No penalty |
| `aggressive` | Aggressively offload all components when not in use | Low | Slowest of the non-streaming modes — frequent CPU↔GPU transfers | No penalty |
| `layer_streaming` | Stream transformer layers one-by-one through GPU | Lowest | Depends on model size (see below) | No penalty when using coarse-stage; per-layer streaming is lossless for most architectures |

The `--offload-to-cpu` flag is a shortcut that picks a reasonable offload mode automatically.

## Layer Streaming

Layer streaming is the most memory-efficient mode. Instead of loading the entire diffusion model into VRAM, it loads one transformer block at a time.

### How it works

1. **Coarse-stage**: If the model fits in VRAM (e.g., quantized models), all layers are loaded at once and the full graph is executed normally. This is as fast as `--offload-mode none` with no quality penalty — the only overhead is the initial CPU→GPU weight transfer.
2. **Per-layer streaming**: If the model doesn't fit (e.g., bf16 models on small GPUs), each transformer block is loaded, executed as a mini-graph, then offloaded back to CPU before the next block. This uses minimal VRAM but is significantly slower due to per-step CPU↔GPU transfers. Output quality is identical to full-model execution — the computation is mathematically equivalent, just split across separate graph evaluations.

The mode is chosen automatically based on available VRAM.

### Supported architectures

- Flux (double_blocks + single_blocks)
- ZImage / Z-Image-Turbo (context_refiner + noise_refiner + layers)
- MMDiT / SD3 (joint_blocks)
- UNet / SD1.x / SDXL (input_blocks + middle_block + output_blocks)
- Anima (blocks)
- WAN (blocks + vace_blocks)
- Qwen Image (transformer_blocks)

### Examples

#### ZImage-Turbo Q8 with layer streaming

```
sd-cli --diffusion-model z_image_turbo-Q8_0.gguf \
--llm Qwen3-4b-Z-Engineer-V2.gguf \
--vae ae.safetensors \
-p "a cat" --cfg-scale 1.0 --diffusion-fa \
-H 1024 -W 688 -s 42 \
--offload-mode layer_streaming -v
```

The Q8 model (6.7 GB) fits in a 12 GB GPU, so coarse-stage streaming is used automatically:
```
[INFO ] z_image model fits in VRAM, using coarse-stage streaming
[INFO ] z_image coarse-stage streaming completed in 1.66s
```

#### Flux-dev Q4 with layer streaming

```
sd-cli --diffusion-model flux1-dev-q4_0.gguf \
--vae ae.safetensors \
--clip_l clip_l.safetensors \
--t5xxl t5xxl_fp16.safetensors \
-p "a lovely cat" --cfg-scale 1.0 --sampling-method euler \
--offload-mode layer_streaming -v
```

#### SD1.5 with aggressive offloading

```
sd-cli -m sd-v1-4.ckpt \
-p "a photograph of an astronaut riding a horse" \
--offload-mode aggressive -v
```

## Combining with other options

- `--diffusion-fa`: Flash attention reduces VRAM further. Recommended with all offload modes. No quality penalty.
- `--clip-on-cpu`: Run CLIP text encoder on CPU. Saves VRAM but slows conditioning. No quality penalty.
- Quantized models (`q4_0`, `q8_0`, etc.) reduce model size, making coarse-stage streaming more likely (faster). **Quantization does reduce output quality** — lower bit depths produce softer details and may introduce artifacts. See [quantization](./quantization_and_gguf.md) for quality comparisons. `q8_0` is nearly indistinguishable from full precision; `q4_0` and below show visible degradation on fine details.

## Quality impact summary

| Technique | Quality Impact |
|-----------|---------------|
| `--offload-mode` (any mode) | **None** — offloading only changes where weights are stored, not the computation |
| `--diffusion-fa` (flash attention) | **None** — mathematically equivalent, just more memory-efficient |
| `--clip-on-cpu` | **None** — same computation on CPU instead of GPU |
| Quantization (`q8_0`) | **Negligible** — nearly identical to full precision |
| Quantization (`q4_0`, `q4_k`) | **Minor** — slight softening, fine details may differ |
| Quantization (`q3_k`, `q2_k`) | **Noticeable** — visible quality loss, best for previews or VRAM-constrained setups |

## Troubleshooting

- **OOM during generation**: Try a more aggressive mode. `layer_streaming` uses the least VRAM.
- **Slow generation**: Coarse-stage streaming (model fits in VRAM) is nearly as fast as no offloading. Per-layer streaming is slower due to CPU-GPU transfers each step. Using quantized models often lets you stay in coarse-stage mode.
- **Black or corrupted output**: This is a bug. Please report it with the model, offload mode, and resolution used.
- **One CPU core pegged at 100% while the GPU is working**: this is the CUDA driver spin-waiting on kernel completion. The default schedule policy (`cudaDeviceScheduleAuto`) often picks `Spin` for short-kernel workloads like per-layer streaming, which busy-waits one host thread for each kernel return. It does *not* slow generation down (the wait is wasted heat, not blocking work), but it looks bad on `top`/`nvtop` and is unfriendly to shared-host setups. Two ways to silence it:

1. Per-run, no rebuild needed:
```
CUDA_DEVICE_SCHEDULE=BlockingSync sd-cli ...
```
2. Per-process, set once at startup:
```c
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
```
Long-lived processes (REST servers, queue workers) should do this.

CPU drops to near zero; GPU performance is unchanged.
5 changes: 4 additions & 1 deletion examples/cli/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -698,7 +698,10 @@ int main(int argc, const char* argv[]) {
vae_decode_only = false;
}

sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview);
// For layer_streaming mode, we need smart offload logic instead of immediate freeing
// This allows should_offload_cond_stage_for_diffusion() to be called and offload T5 before streaming
bool free_params_immediately = (ctx_params.offload_config.mode != SD_OFFLOAD_LAYER_STREAMING);
sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, free_params_immediately, cli_params.taesd_preview);

SDImageVec results;
int num_results = 0;
Expand Down
100 changes: 99 additions & 1 deletion examples/common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,78 @@ ArgOptions SDContextParams::get_options() {
return 1;
};

auto on_offload_mode_arg = [&](int argc, const char** argv, int index) {
if (++index >= argc) {
return -1;
}
const char* arg = argv[index];
offload_config.mode = str_to_offload_mode(arg);
if (offload_config.mode == SD_OFFLOAD_MODE_COUNT) {
LOG_ERROR("error: invalid offload mode %s", arg);
return -1;
}
return 1;
};

auto on_vram_estimation_arg = [&](int argc, const char** argv, int index) {
if (++index >= argc) {
return -1;
}
const char* arg = argv[index];
offload_config.vram_estimation = str_to_vram_estimation(arg);
if (offload_config.vram_estimation == SD_VRAM_EST_COUNT) {
LOG_ERROR("error: invalid VRAM estimation method %s", arg);
return -1;
}
return 1;
};

auto on_streaming_prefetch_arg = [&](int argc, const char** argv, int index) {
if (++index >= argc) {
return -1;
}
try {
offload_config.streaming_prefetch_layers = std::stoi(argv[index]);
if (offload_config.streaming_prefetch_layers < 0) {
LOG_ERROR("error: streaming prefetch must be >= 0");
return -1;
}
} catch (...) {
LOG_ERROR("error: invalid streaming prefetch value %s", argv[index]);
return -1;
}
return 1;
};

auto on_streaming_min_vram_arg = [&](int argc, const char** argv, int index) {
if (++index >= argc) {
return -1;
}
try {
int mb = std::stoi(argv[index]);
if (mb < 0) {
LOG_ERROR("error: streaming min VRAM must be >= 0");
return -1;
}
offload_config.streaming_min_free_vram = static_cast<size_t>(mb) * 1024 * 1024;
} catch (...) {
LOG_ERROR("error: invalid streaming min VRAM value %s", argv[index]);
return -1;
}
return 1;
};

options.bool_options.push_back({"", "--offload-log", "log offload events", true, &offload_config.log_offload_events});
options.bool_options.push_back({"", "--no-offload-log", "do not log offload events", false, &offload_config.log_offload_events});
options.bool_options.push_back({"", "--offload-cond-stage", "offload cond stage to CPU after use", true, &offload_config.offload_cond_stage});
options.bool_options.push_back({"", "--no-offload-cond-stage", "do not offload cond stage", false, &offload_config.offload_cond_stage});
options.bool_options.push_back({"", "--offload-diffusion", "offload diffusion model to CPU after use", true, &offload_config.offload_diffusion});
options.bool_options.push_back({"", "--no-offload-diffusion", "do not offload diffusion model", false, &offload_config.offload_diffusion});
options.bool_options.push_back({"", "--reload-cond-stage", "reload cond stage to GPU before use", true, &offload_config.reload_cond_stage});
options.bool_options.push_back({"", "--no-reload-cond-stage", "do not reload cond stage", false, &offload_config.reload_cond_stage});
options.bool_options.push_back({"", "--reload-diffusion", "reload diffusion to GPU before use", true, &offload_config.reload_diffusion});
options.bool_options.push_back({"", "--no-reload-diffusion", "do not reload diffusion", false, &offload_config.reload_diffusion});

options.manual_options = {
{"",
"--type",
Expand All @@ -564,6 +636,24 @@ ArgOptions SDContextParams::get_options() {
"but it usually offers faster inference speed and, in some cases, lower memory usage. "
"The at_runtime mode, on the other hand, is exactly the opposite.",
on_lora_apply_mode_arg},
{"",
"--offload-mode",
"dynamic VRAM offloading mode, one of [none, cond_only, cond_diffusion, aggressive, layer_streaming] (default: none). "
"Use 'cond_only' to offload the LLM/CLIP model to CPU after conditioning. "
"Use 'layer_streaming' to stream model layers one-by-one (enables models larger than VRAM).",
on_offload_mode_arg},
{"",
"--vram-estimation",
"VRAM estimation method for smart offloading, one of [dryrun, formula] (default: dryrun)",
on_vram_estimation_arg},
{"",
"--streaming-prefetch",
"Number of layers to prefetch ahead during layer streaming (default: 1)",
on_streaming_prefetch_arg},
{"",
"--streaming-min-vram",
"Minimum VRAM to keep free during layer streaming, in MB (default: 512)",
on_streaming_min_vram_arg},
};

return options;
Expand Down Expand Up @@ -693,7 +783,14 @@ std::string SDContextParams::to_string() const {
<< " chroma_t5_mask_pad: " << chroma_t5_mask_pad << ",\n"
<< " prediction: " << sd_prediction_name(prediction) << ",\n"
<< " lora_apply_mode: " << sd_lora_apply_mode_name(lora_apply_mode) << ",\n"
<< " force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << "\n"
<< " force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << ",\n"
<< " offload_config: { mode=" << sd_offload_mode_name(offload_config.mode)
<< ", vram_est=" << sd_vram_estimation_name(offload_config.vram_estimation)
<< ", offload_cond=" << (offload_config.offload_cond_stage ? "true" : "false")
<< ", offload_diff=" << (offload_config.offload_diffusion ? "true" : "false")
<< ", reload_cond=" << (offload_config.reload_cond_stage ? "true" : "false")
<< ", reload_diff=" << (offload_config.reload_diffusion ? "true" : "false")
<< ", log=" << (offload_config.log_offload_events ? "true" : "false") << " }\n"
<< "}";
return oss.str();
}
Expand Down Expand Up @@ -751,6 +848,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
chroma_t5_mask_pad,
qwen_image_zero_cond_t,
max_vram,
offload_config,
};
return sd_ctx_params;
}
Expand Down
6 changes: 6 additions & 0 deletions examples/common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,12 @@ struct SDContextParams {
bool force_sdxl_vae_conv_scale = false;

float flow_shift = INFINITY;

// Dynamic tensor offloading configuration
sd_offload_config_t offload_config = {SD_OFFLOAD_NONE, SD_VRAM_EST_DRYRUN, true, false, false, true, true,
0, 2ULL * 1024 * 1024 * 1024,
false, 1, 0, 512ULL * 1024 * 1024};

ArgOptions get_options();
void build_embedding_map();
bool resolve(SDMode mode);
Expand Down
86 changes: 85 additions & 1 deletion include/stable-diffusion.h
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,53 @@ enum lora_apply_mode_t {
LORA_APPLY_MODE_COUNT,
};

// Component identifiers for dynamic tensor offloading
enum sd_component_t {
SD_COMPONENT_COND_STAGE, // LLM/CLIP text embedder
SD_COMPONENT_CLIP_VISION, // CLIP vision encoder (for SVD/Wan i2v)
SD_COMPONENT_DIFFUSION, // UNet/DiT/Flux diffusion model
SD_COMPONENT_VAE, // VAE encoder/decoder
SD_COMPONENT_CONTROL_NET, // ControlNet (if loaded)
SD_COMPONENT_PMID, // PhotoMaker ID encoder (if loaded)
SD_COMPONENT_COUNT
};

// Offload mode for automatic GPU memory management
enum sd_offload_mode_t {
SD_OFFLOAD_NONE, // Keep all components on GPU (default, fastest)
SD_OFFLOAD_COND_ONLY, // Offload only conditioning (LLM/CLIP) after use
SD_OFFLOAD_COND_DIFFUSION, // Offload conditioning + diffusion, keep VAE
SD_OFFLOAD_AGGRESSIVE, // Offload each component after use (saves most VRAM)
SD_OFFLOAD_LAYER_STREAMING, // Stream layers one-by-one (enables models larger than VRAM)
SD_OFFLOAD_MODE_COUNT
};

// VRAM estimation method for smart offloading decisions
enum sd_vram_estimation_t {
SD_VRAM_EST_DRYRUN, // Dry-run graph allocation for exact size (default, accurate)
SD_VRAM_EST_FORMULA, // Formula-based estimation (faster, approximate)
SD_VRAM_EST_COUNT
};

// Offload configuration for fine-grained control
typedef struct {
enum sd_offload_mode_t mode; // Offload mode
enum sd_vram_estimation_t vram_estimation; // VRAM estimation method
bool offload_cond_stage; // Offload LLM/CLIP after conditioning
bool offload_diffusion; // Offload diffusion model after sampling
bool reload_cond_stage; // Reload LLM/CLIP for next generation
bool reload_diffusion; // Reload diffusion model for next generation
bool log_offload_events; // Log offload/reload events
size_t min_offload_size; // Minimum component size to offload (bytes), 0 = no minimum
size_t target_free_vram; // Target free VRAM before VAE decode (bytes), 0 = always offload when mode is set

// Layer streaming configuration (for SD_OFFLOAD_LAYER_STREAMING mode)
bool layer_streaming_enabled; // Enable layer-by-layer streaming execution
int streaming_prefetch_layers; // Number of layers to prefetch ahead (default: 1)
int streaming_keep_layers_behind; // Layers to keep after execution (for skip connections)
size_t streaming_min_free_vram; // Minimum VRAM to keep free during streaming (bytes)
} sd_offload_config_t;

typedef struct {
bool enabled;
int tile_size_x;
Expand Down Expand Up @@ -203,7 +250,8 @@ typedef struct {
bool chroma_use_t5_mask;
int chroma_t5_mask_pad;
bool qwen_image_zero_cond_t;
float max_vram;
float max_vram; // GiB budget for graph-cut segmented param offload (0 = disabled)
sd_offload_config_t offload_config; // Cross-stage and layer-streaming offload configuration
} sd_ctx_params_t;

typedef struct {
Expand Down Expand Up @@ -393,6 +441,11 @@ SD_API const char* sd_preview_name(enum preview_t preview);
SD_API enum preview_t str_to_preview(const char* str);
SD_API const char* sd_lora_apply_mode_name(enum lora_apply_mode_t mode);
SD_API enum lora_apply_mode_t str_to_lora_apply_mode(const char* str);
SD_API const char* sd_offload_mode_name(enum sd_offload_mode_t mode);
SD_API enum sd_offload_mode_t str_to_offload_mode(const char* str);
SD_API const char* sd_vram_estimation_name(enum sd_vram_estimation_t method);
SD_API enum sd_vram_estimation_t str_to_vram_estimation(const char* str);
SD_API void sd_offload_config_init(sd_offload_config_t* config);
SD_API const char* sd_hires_upscaler_name(enum sd_hires_upscaler_t upscaler);
SD_API enum sd_hires_upscaler_t str_to_sd_hires_upscaler(const char* str);

Expand All @@ -411,6 +464,9 @@ SD_API char* sd_sample_params_to_str(const sd_sample_params_t* sample_params);
SD_API enum sample_method_t sd_get_default_sample_method(const sd_ctx_t* sd_ctx);
SD_API enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_method_t sample_method);

// Get the model architecture/version name (e.g., "SD 1.x", "SDXL", "Flux", "Z-Image", etc.)
SD_API const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx);

SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);
Expand Down Expand Up @@ -450,6 +506,34 @@ SD_API bool preprocess_canny(sd_image_t image,
SD_API const char* sd_commit(void);
SD_API const char* sd_version(void);

// Dynamic tensor offloading API
// These functions allow runtime GPU memory management by moving model components
// between CPU and GPU. This enables running larger models on limited VRAM by
// keeping only the currently-active component on GPU.

// Offload component from GPU to CPU (frees GPU memory)
// Returns true on success, false if component doesn't exist or is already on CPU
SD_API bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, enum sd_component_t component);

// Reload component from CPU to GPU (allocates GPU memory)
// Returns true on success, false if component doesn't exist or allocation failed
SD_API bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);

// Query whether component is currently on GPU
// Returns true if on GPU, false if on CPU or component doesn't exist
SD_API bool sd_is_on_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);

// Get component's current memory usage in bytes
// Returns the buffer size if component exists, 0 otherwise
SD_API size_t sd_get_component_vram(sd_ctx_t* sd_ctx, enum sd_component_t component);

// Get human-readable name for a component
SD_API const char* sd_component_name(enum sd_component_t component);

// Free all GPU resources (offload all components to CPU and clear LoRAs)
// Call this before unloading a model to ensure GPU memory is released
SD_API void sd_free_gpu_resources(sd_ctx_t* sd_ctx);

#ifdef __cplusplus
}
#endif
Expand Down
Loading
Loading