Skip to content

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404

Open
JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo:fix/large-model-device-map-support
Open

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404
JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo:fix/large-model-device-map-support

Conversation

@JairoGuo
Copy link

@JairoGuo JairoGuo commented Feb 5, 2026

Summary

This PR adds two features to the diffusers backend:

  1. Multi-GPU support for large models - Enables loading models >80GB across multiple GPUs
  2. Shutdown method - Properly releases GPU memory for dynamic model reloading

Problem

  1. Very large models (e.g., Qwen-Image ~95GB) cause OOM when loading on a single GPU
  2. No way to release GPU memory without restarting the service, preventing dynamic LoRA switching

Solution

1. Multi-GPU Distribution (device_map)

When LowVRAM is enabled:

  • Add low_cpu_mem_usage=True and device_map="balanced" during loading
  • Skip enable_model_cpu_offload() (conflicts with device_map)
  • Skip .to(device) (conflicts with device_map)

2. Shutdown Method

Add Shutdown() that:

  • Releases pipeline, controlnet, and compel objects
  • Clears CUDA cache with torch.cuda.empty_cache()
  • Resets state flags

This enables dynamic LoRA switching:

# 1. Unload model
POST /backend/shutdown {"model": "qwen-image"}

# 2. Update config (change lora_adapters)

# 3. Request triggers reload with new config
POST /v1/images/generations {...}

Testing

- Model: Qwen-Image (~95GB)
- Hardware: NVIDIA H20 (96GB) x3
- Tested: Multi-LoRA loading, dynamic LoRA switching, GPU memory release

…stribution

When loading very large models (e.g., Qwen-Image ~95GB) on GPUs with limited
headroom, the model loads successfully but leaves no memory for inference.

This PR adds support for multi-GPU distribution via device_map when LowVRAM
is enabled:

1. Add low_cpu_mem_usage=True and device_map='balanced' during model loading
   to distribute large models across multiple GPUs

2. Skip enable_model_cpu_offload() when device_map is used, as they conflict
   with each other (ValueError: device mapping strategy doesn't allow
   enable_model_cpu_offload)

3. Skip .to(device) when device_map is used, as they also conflict
   (ValueError: device mapping strategy doesn't allow explicit device
   placement using to())

This enables running models like Qwen-Image on multi-GPU setups where a
single GPU doesn't have enough memory for both model weights and inference.

Tested with:
- Qwen-Image (~95GB) on 3x NVIDIA H20 (96GB each)
- Configuration: low_vram: true, pipeline_type: QwenImagePipeline
@netlify
Copy link

netlify bot commented Feb 5, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit e3a64e0
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/6984360d35d0830008c0946b
😎 Deploy Preview https://deploy-preview-8404--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@JairoGuo JairoGuo changed the title fix(diffusers): support large models with device_map for multi-GPU distribution fix(diffusers): support large models with device_map for multi-GPU distribution Feb 5, 2026
Add Shutdown method to the diffusers backend that properly releases GPU
memory when a model is unloaded. This enables dynamic model reloading
with different configurations (e.g., switching LoRA adapters) without
restarting the service.

The Shutdown method:
- Releases the pipeline, controlnet, and compel objects
- Clears CUDA cache with torch.cuda.empty_cache()
- Resets state flags (img2vid, txt2vid, ltx2_pipeline)

This works with LocalAI's existing /backend/shutdown API endpoint,
which terminates the gRPC process. The explicit cleanup ensures
GPU memory is properly released before process termination.

Tested with Qwen-Image (~95GB) on NVIDIA H20 GPUs.
@JairoGuo JairoGuo changed the title fix(diffusers): support large models with device_map for multi-GPU distribution feat(diffusers): support large models and add Shutdown for dynamic reloading Feb 5, 2026
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))

def Shutdown(self, request, context):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants