Skip to content

Support Qwen3 Omni#4411

Open
CUHKSZzxy wants to merge 26 commits into
InternLM:mainfrom
CUHKSZzxy:support-qwen3-omni
Open

Support Qwen3 Omni#4411
CUHKSZzxy wants to merge 26 commits into
InternLM:mainfrom
CUHKSZzxy:support-qwen3-omni

Conversation

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy commented Mar 13, 2026

Summary

Support Qwen3-Omni thinker inference in the PyTorch backend.

This PR adds Qwen3-Omni model registration, HF processor integration, and multimodal preprocessing for image, video, audio, and mixed image/audio/video inputs. Audio support is currently limited to Qwen3-Omni.

Changes

  • Add Qwen3-Omni PyTorch thinker model support.
  • Add Qwen3-Omni VL preprocessor using the shared get_input_prompt -> preprocess path.
  • Support image-only, video-only, audio-only, and mixed image/audio/video prompts.
  • Keep Qwen3-Omni video expansion as whole-video spans, distinct from Qwen3VL per-frame timestamp handling.
  • Add audio media parsing for OpenAI-style multimodal messages.
  • Add multimodal input docs and examples, including Qwen3-Omni audio usage.

Accuracy Check

Local run config:

  • Model: Qwen3-Omni-30B-A3B-Instruct
  • Backend: LMDeploy PyTorch, tp=1
  • Server: OpenAI-compatible API
  • Decode: temperature=0
Benchmark LMDeploy local result Official related score Notes
GSM8K 1258 / 1314 = 95.74% 91.36 Official score is from Qwen3-Omni technical report Table 16 for Qwen3-Omni-30B-A3B-Base-202507, not the exact Instruct checkpoint/harness.
OCRBench 848 / 1000 = 84.80% 86.0 Official score is from Qwen3-Omni technical report Table 16 for Qwen3-Omni-30B-A3B-Base-202507, not the exact Instruct checkpoint/harness.

Artifacts are saved locally under benchmark/e2e_qwen3_omni_gsm8k_ocrbench/.

Reference: Qwen3-Omni Technical Report, Table 16: https://arxiv.org/pdf/2509.17765

Notes

  • Talker/audio-generation support is not included.
  • Audio input support is scoped to Qwen3-Omni.
  • Advanced use_audio_in_video=True interleaving is not enabled in this patch.

Related

Prerequisite PR

Assistance

Assisted with Codex + GPT-5.5 High

@CUHKSZzxy CUHKSZzxy force-pushed the support-qwen3-omni branch from 8d64a7a to 4c6bc99 Compare March 19, 2026 07:13
@CUHKSZzxy CUHKSZzxy changed the title [WIP] Support qwen3-omni Support Qwen3 Omni May 11, 2026
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review May 11, 2026 04:37
Copilot AI review requested due to automatic review settings May 11, 2026 04:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds PyTorch-backend support for Qwen3-Omni (thinker), extending the multimodal preprocessing pipeline to handle audio (alongside image/video) and registering the new architecture across model/config dispatch. It also updates the OpenAI-style multimodal message parsing and documentation to include audio inputs.

Changes:

  • Register Qwen3-Omni for VL preprocessing + PyTorch model loading (module map, arch list, config builder, max-len derivation).
  • Extend multimodal preprocessing/utilities to support audio features, plus mixed image/audio/video offset handling.
  • Add audio media loading and update API-server multimodal parsing + docs/examples for audio usage.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_vl/test_qwen3_omni_processor.py Adds unit tests for Qwen3-Omni preprocessing, mixed-modality offsets, and audio masking/mrope behavior.
tests/test_lmdeploy/test_content_merge.py Extends multimodal parsing tests to include audio items and updates “unknown type” coverage.
lmdeploy/vl/model/qwen3_omni.py Adds Qwen3-Omni VL model registration + HF processor integration and special-token wiring.
lmdeploy/vl/model/preprocess_utils.py Expands bundled HF outputs for video/audio and sorts expanded items by offsets to restore prompt order.
lmdeploy/vl/model/builder.py Registers Qwen3-Omni in the VL model builder import list.
lmdeploy/vl/model/base.py Extends new-style VisionModel.preprocess to collect audio inputs and pass audio_kwargs to HF processors.
lmdeploy/vl/media/audio.py Introduces audio MediaIO implementation (librosa/soundfile) for URL/file/base64 audio loading.
lmdeploy/utils.py Adjusts max-length derivation to use thinker_config.text_config for Qwen3-Omni thinker configs.
lmdeploy/serve/processors/multimodal.py Adds OpenAI-style audio_url/audio parsing using AudioMediaIO; updates multimodal type detection.
lmdeploy/pytorch/multimodal/data_type.py Reorders MultiModalData fields to place modality before mrope_pos_ids.
lmdeploy/pytorch/models/utils/model.py Extends multimodal mask computation to include audio token IDs.
lmdeploy/pytorch/models/qwen3_omni_moe_thinker.py Adds the Qwen3-Omni thinker PyTorch model, including audio tower + mixed-modality input processing.
lmdeploy/pytorch/models/module_map.py Maps HF arch Qwen3OmniMoeForConditionalGeneration to the PyTorch thinker implementation.
lmdeploy/pytorch/configurations/qwen3_omni.py Adds a config builder for Qwen3-Omni (thinker text config + mrope enabled).
lmdeploy/model.py Improves chat-template resolution by falling back to processor-provided chat_template when tokenizer lacks it.
lmdeploy/archs.py Adds Qwen3-Omni to VL arch detection and marks it unsupported for TurboMind.
docs/zh_cn/multi_modal/vl_pipeline.md Updates “see also” to include audio in multimodal inputs reference.
docs/zh_cn/multi_modal/multimodal_inputs.md Adds audio input docs/examples and updates native video support note to include Qwen3-Omni.
docs/zh_cn/multi_modal/index.rst Adjusts toctree structure for the multi-modal section.
docs/zh_cn/index.rst Adds multimodal inputs page to the main Chinese documentation index.
docs/en/multi_modal/vl_pipeline.md Updates “see also” to include audio in multimodal inputs reference.
docs/en/multi_modal/multimodal_inputs.md Adds audio input docs/examples and updates native video support note to include Qwen3-Omni.
docs/en/multi_modal/index.rst Adjusts toctree structure for the multi-modal section.
docs/en/index.rst Adds multimodal inputs page to the main English documentation index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/pytorch/models/qwen3_omni_moe_thinker.py Outdated
Comment thread lmdeploy/vl/media/audio.py Outdated
@lvhan028 lvhan028 added enhancement New feature or request labels May 13, 2026
# Conflicts:
#	tests/test_lmdeploy/test_vl/test_preprocess_utils.py
@lvhan028 lvhan028 requested review from grimoire and lvhan028 May 20, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants