feat(diffusers): support large models and add Shutdown for dynamic reloading#8404
Open
JairoGuo wants to merge 2 commits intomudler:masterfrom
Open
feat(diffusers): support large models and add Shutdown for dynamic reloading#8404JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo wants to merge 2 commits intomudler:masterfrom
Conversation
…stribution When loading very large models (e.g., Qwen-Image ~95GB) on GPUs with limited headroom, the model loads successfully but leaves no memory for inference. This PR adds support for multi-GPU distribution via device_map when LowVRAM is enabled: 1. Add low_cpu_mem_usage=True and device_map='balanced' during model loading to distribute large models across multiple GPUs 2. Skip enable_model_cpu_offload() when device_map is used, as they conflict with each other (ValueError: device mapping strategy doesn't allow enable_model_cpu_offload) 3. Skip .to(device) when device_map is used, as they also conflict (ValueError: device mapping strategy doesn't allow explicit device placement using to()) This enables running models like Qwen-Image on multi-GPU setups where a single GPU doesn't have enough memory for both model weights and inference. Tested with: - Qwen-Image (~95GB) on 3x NVIDIA H20 (96GB each) - Configuration: low_vram: true, pipeline_type: QwenImagePipeline
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Add Shutdown method to the diffusers backend that properly releases GPU memory when a model is unloaded. This enables dynamic model reloading with different configurations (e.g., switching LoRA adapters) without restarting the service. The Shutdown method: - Releases the pipeline, controlnet, and compel objects - Clears CUDA cache with torch.cuda.empty_cache() - Resets state flags (img2vid, txt2vid, ltx2_pipeline) This works with LocalAI's existing /backend/shutdown API endpoint, which terminates the gRPC process. The explicit cleanup ensures GPU memory is properly released before process termination. Tested with Qwen-Image (~95GB) on NVIDIA H20 GPUs.
mudler
reviewed
Feb 5, 2026
| def Health(self, request, context): | ||
| return backend_pb2.Reply(message=bytes("OK", 'utf-8')) | ||
|
|
||
| def Shutdown(self, request, context): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds two features to the diffusers backend:
Problem
Solution
1. Multi-GPU Distribution (
device_map)When
LowVRAMis enabled:low_cpu_mem_usage=Trueanddevice_map="balanced"during loadingenable_model_cpu_offload()(conflicts with device_map).to(device)(conflicts with device_map)2. Shutdown Method
Add
Shutdown()that:torch.cuda.empty_cache()This enables dynamic LoRA switching: