Support Qwen3 Omni#4411
Open
CUHKSZzxy wants to merge 26 commits into
Open
Conversation
8d64a7a to
4c6bc99
Compare
# Conflicts: # lmdeploy/model.py # lmdeploy/serve/processors/multimodal.py
# Conflicts: # lmdeploy/archs.py
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds PyTorch-backend support for Qwen3-Omni (thinker), extending the multimodal preprocessing pipeline to handle audio (alongside image/video) and registering the new architecture across model/config dispatch. It also updates the OpenAI-style multimodal message parsing and documentation to include audio inputs.
Changes:
- Register Qwen3-Omni for VL preprocessing + PyTorch model loading (module map, arch list, config builder, max-len derivation).
- Extend multimodal preprocessing/utilities to support audio features, plus mixed image/audio/video offset handling.
- Add audio media loading and update API-server multimodal parsing + docs/examples for audio usage.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_vl/test_qwen3_omni_processor.py | Adds unit tests for Qwen3-Omni preprocessing, mixed-modality offsets, and audio masking/mrope behavior. |
| tests/test_lmdeploy/test_content_merge.py | Extends multimodal parsing tests to include audio items and updates “unknown type” coverage. |
| lmdeploy/vl/model/qwen3_omni.py | Adds Qwen3-Omni VL model registration + HF processor integration and special-token wiring. |
| lmdeploy/vl/model/preprocess_utils.py | Expands bundled HF outputs for video/audio and sorts expanded items by offsets to restore prompt order. |
| lmdeploy/vl/model/builder.py | Registers Qwen3-Omni in the VL model builder import list. |
| lmdeploy/vl/model/base.py | Extends new-style VisionModel.preprocess to collect audio inputs and pass audio_kwargs to HF processors. |
| lmdeploy/vl/media/audio.py | Introduces audio MediaIO implementation (librosa/soundfile) for URL/file/base64 audio loading. |
| lmdeploy/utils.py | Adjusts max-length derivation to use thinker_config.text_config for Qwen3-Omni thinker configs. |
| lmdeploy/serve/processors/multimodal.py | Adds OpenAI-style audio_url/audio parsing using AudioMediaIO; updates multimodal type detection. |
| lmdeploy/pytorch/multimodal/data_type.py | Reorders MultiModalData fields to place modality before mrope_pos_ids. |
| lmdeploy/pytorch/models/utils/model.py | Extends multimodal mask computation to include audio token IDs. |
| lmdeploy/pytorch/models/qwen3_omni_moe_thinker.py | Adds the Qwen3-Omni thinker PyTorch model, including audio tower + mixed-modality input processing. |
| lmdeploy/pytorch/models/module_map.py | Maps HF arch Qwen3OmniMoeForConditionalGeneration to the PyTorch thinker implementation. |
| lmdeploy/pytorch/configurations/qwen3_omni.py | Adds a config builder for Qwen3-Omni (thinker text config + mrope enabled). |
| lmdeploy/model.py | Improves chat-template resolution by falling back to processor-provided chat_template when tokenizer lacks it. |
| lmdeploy/archs.py | Adds Qwen3-Omni to VL arch detection and marks it unsupported for TurboMind. |
| docs/zh_cn/multi_modal/vl_pipeline.md | Updates “see also” to include audio in multimodal inputs reference. |
| docs/zh_cn/multi_modal/multimodal_inputs.md | Adds audio input docs/examples and updates native video support note to include Qwen3-Omni. |
| docs/zh_cn/multi_modal/index.rst | Adjusts toctree structure for the multi-modal section. |
| docs/zh_cn/index.rst | Adds multimodal inputs page to the main Chinese documentation index. |
| docs/en/multi_modal/vl_pipeline.md | Updates “see also” to include audio in multimodal inputs reference. |
| docs/en/multi_modal/multimodal_inputs.md | Adds audio input docs/examples and updates native video support note to include Qwen3-Omni. |
| docs/en/multi_modal/index.rst | Adjusts toctree structure for the multi-modal section. |
| docs/en/index.rst | Adds multimodal inputs page to the main English documentation index. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # tests/test_lmdeploy/test_vl/test_preprocess_utils.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Support Qwen3-Omni thinker inference in the PyTorch backend.
This PR adds Qwen3-Omni model registration, HF processor integration, and multimodal preprocessing for image, video, audio, and mixed image/audio/video inputs. Audio support is currently limited to Qwen3-Omni.
Changes
get_input_prompt -> preprocesspath.Accuracy Check
Local run config:
Qwen3-Omni-30B-A3B-Instructtp=1temperature=0Qwen3-Omni-30B-A3B-Base-202507, not the exact Instruct checkpoint/harness.Qwen3-Omni-30B-A3B-Base-202507, not the exact Instruct checkpoint/harness.Artifacts are saved locally under
benchmark/e2e_qwen3_omni_gsm8k_ocrbench/.Reference: Qwen3-Omni Technical Report, Table 16: https://arxiv.org/pdf/2509.17765
Notes
use_audio_in_video=Trueinterleaving is not enabled in this patch.Related
Prerequisite PR
Assistance
Assisted with Codex + GPT-5.5 High