feat(diffusers): support large models and add Shutdown for dynamic reloading by JairoGuo · Pull Request #8404 · mudler/LocalAI

JairoGuo · 2026-02-05T05:25:56Z

Summary

This PR adds two features to the diffusers backend:

Multi-GPU support for large models - Enables loading models >80GB across multiple GPUs
Shutdown method - Properly releases GPU memory for dynamic model reloading

Problem

Very large models (e.g., Qwen-Image ~95GB) cause OOM when loading on a single GPU
No way to release GPU memory without restarting the service, preventing dynamic LoRA switching

Solution

1. Multi-GPU Distribution (`device_map`)

When LowVRAM is enabled:

Add low_cpu_mem_usage=True and device_map="balanced" during loading
Skip enable_model_cpu_offload() (conflicts with device_map)
Skip .to(device) (conflicts with device_map)

2. Shutdown Method

Add Shutdown() that:

Releases pipeline, controlnet, and compel objects
Clears CUDA cache with torch.cuda.empty_cache()
Resets state flags

This enables dynamic LoRA switching:

# 1. Unload model
POST /backend/shutdown {"model": "qwen-image"}

# 2. Update config (change lora_adapters)

# 3. Request triggers reload with new config
POST /v1/images/generations {...}

Testing

- Model: Qwen-Image (~95GB)
- Hardware: NVIDIA H20 (96GB) x3
- Tested: Multi-LoRA loading, dynamic LoRA switching, GPU memory release

…stribution When loading very large models (e.g., Qwen-Image ~95GB) on GPUs with limited headroom, the model loads successfully but leaves no memory for inference. This PR adds support for multi-GPU distribution via device_map when LowVRAM is enabled: 1. Add low_cpu_mem_usage=True and device_map='balanced' during model loading to distribute large models across multiple GPUs 2. Skip enable_model_cpu_offload() when device_map is used, as they conflict with each other (ValueError: device mapping strategy doesn't allow enable_model_cpu_offload) 3. Skip .to(device) when device_map is used, as they also conflict (ValueError: device mapping strategy doesn't allow explicit device placement using to()) This enables running models like Qwen-Image on multi-GPU setups where a single GPU doesn't have enough memory for both model weights and inference. Tested with: - Qwen-Image (~95GB) on 3x NVIDIA H20 (96GB each) - Configuration: low_vram: true, pipeline_type: QwenImagePipeline

netlify · 2026-02-05T05:26:01Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`e3a64e0`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/6984360d35d0830008c0946b
😎 Deploy Preview	https://deploy-preview-8404--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Add Shutdown method to the diffusers backend that properly releases GPU memory when a model is unloaded. This enables dynamic model reloading with different configurations (e.g., switching LoRA adapters) without restarting the service. The Shutdown method: - Releases the pipeline, controlnet, and compel objects - Clears CUDA cache with torch.cuda.empty_cache() - Resets state flags (img2vid, txt2vid, ltx2_pipeline) This works with LocalAI's existing /backend/shutdown API endpoint, which terminates the gRPC process. The explicit cleanup ensures GPU memory is properly released before process termination. Tested with Qwen-Image (~95GB) on NVIDIA H20 GPUs.

mudler · 2026-02-05T09:17:11Z

backend/python/diffusers/backend.py

    def Health(self, request, context):
        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))

+    def Shutdown(self, request, context):


This is unused?

JairoGuo changed the title ~~fix(diffusers): support large models with device_map for multi-GPU distribution~~ fix(diffusers): support large models with device_map for multi-GPU distribution Feb 5, 2026

JairoGuo changed the title ~~fix(diffusers): support large models with device_map for multi-GPU distribution~~ feat(diffusers): support large models and add Shutdown for dynamic reloading Feb 5, 2026

mudler reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404
JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo:fix/large-model-device-map-support

JairoGuo commented Feb 5, 2026 •

edited

Loading

Uh oh!

netlify bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

mudler Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JairoGuo commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

1. Multi-GPU Distribution (device_map)

2. Shutdown Method

Uh oh!

netlify bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

mudler Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JairoGuo commented Feb 5, 2026 •

edited

Loading

1. Multi-GPU Distribution (`device_map`)

netlify bot commented Feb 5, 2026 •

edited

Loading