Skip to content

feat(template): add MiniCPM-o-4.5 training template with audio support and fix image_bound bug#8307

Open
xxddccaa wants to merge 2 commits intomodelscope:mainfrom
xxddccaa:feat/minicpmo4_5-template-audio-support
Open

feat(template): add MiniCPM-o-4.5 training template with audio support and fix image_bound bug#8307
xxddccaa wants to merge 2 commits intomodelscope:mainfrom
xxddccaa:feat/minicpmo4_5-template-audio-support

Conversation

@xxddccaa
Copy link

This PR adds a dedicated MiniCPMO4_5Template class for training the MiniCPM-o-4.5 model with video + audio + text → text tasks, and fixes a critical bug that prevented training from running.

What was added

A new MiniCPMO4_5Template subclass of MiniCPMV2_6Template is introduced, registered under the minicpmo4_5 template type. It extends the existing video/image pipeline with full audio support:

  • replace_tag handles audio media type by loading waveforms via load_audio, truncating to a configurable max duration (default 60s), and inserting an <|audio_start|><unk>×N<|audio_end|> placeholder whose length is computed from the waveform length, hop size, and pooling step.
  • _encode calls the parent to handle video/image encoding, then processes audio through processor.process_audio and appends audio_features, audio_feature_lens, and audio_bounds to the encoded dict.
  • _data_collator pads and concatenates audio features across batch items, and collects audio_bounds and audio_feature_lens per sample.

Bug fixed

MiniCPM-o-4.5 uses <unk> tokens as placeholders for both image patches and audio frames. The parent class MiniCPMV2_6Template._encode computes image_bound by finding all contiguous <unk> runs in input_ids—which incorrectly includes audio <unk> regions.

This causes get_vllm_embedding in the model to call torch.stack on ranges of different lengths (e.g., 64 tokens per video frame vs. variable-length audio chunks), resulting in:

RuntimeError: stack expects each tensor to be equal size, but got [64] at entry 0 and [218] at entry 22

The fix masks out all <unk> positions that fall inside <|audio_start|>...<|audio_end|> spans before recomputing image_bound, so only visual patch tokens are included.

…t and fix image_bound bug

- Add MiniCPMO4_5Template subclass of MiniCPMV2_6Template, registered under minicpmo4_5 template type
- replace_tag handles audio media type: loads waveform via load_audio, truncates to configurable max duration (60s), inserts audio placeholder tokens
- _encode calls parent for video/image encoding, then processes audio via processor.process_audio, appends audio_features, audio_feature_lens, audio_bounds
- _data_collator pads/concatenates audio features, collects audio_bounds and audio_feature_lens per sample
- Fix image_bound bug: MiniCPMO4_5 uses <unk> tokens for both image patches and audio frames; mask out audio <unk> spans before computing image_bound to avoid torch.stack size mismatch in get_vllm_embedding
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the MiniCPM-o-4.5 model's capabilities by integrating comprehensive audio processing support, enabling it to handle multimodal inputs including audio, video, and text. Concurrently, it addresses a critical bug in the image_bound calculation that previously hindered training, ensuring stable and correct operation for multimodal tasks.

Highlights

  • New MiniCPM-o-4.5 Template: Introduced a dedicated MiniCPMO4_5Template class, registered under the minicpmo4_5 template type, to support training the MiniCPM-o-4.5 model with video, audio, and text inputs for text generation tasks.
  • Audio Support Implementation: The new template extends the existing video/image pipeline with full audio support, including handling audio media types in replace_tag to load and truncate waveforms, processing audio features in _encode, and collating audio data in _data_collator.
  • Critical image_bound Bug Fix: Resolved a bug where image_bound was incorrectly computed due to <unk> tokens being used for both image patches and audio frames. The fix now masks out audio <unk> positions, preventing torch.stack errors caused by mismatched tensor sizes during embedding generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • swift/template/templates/minicpm.py
    • Added MiniCPMO4_5Template class to support audio processing for the MiniCPM-o-4.5 model.
    • Implemented audio loading, truncation, and placeholder generation within the replace_tag method.
    • Modified the _encode method to correctly process audio features and fix the image_bound calculation by excluding audio-related <unk> tokens.
    • Extended the _data_collator to handle padding and concatenation of audio features across batch items.
    • Updated template registration to use the new MiniCPMO4_5Template for MLLMTemplateType.minicpmo4_5.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new template MiniCPMO4_5Template to support audio in MiniCPM-o-4.5 model training, and crucially fixes a bug where <unk> tokens for audio were incorrectly included in image_bound, leading to runtime errors. The changes are well-structured, with the new template handling audio loading, encoding, and collation. The bug fix correctly isolates image-only <unk> tokens before computing image_bound. I have a couple of suggestions to improve maintainability and correct a minor documentation inconsistency.


Audio placeholder: <|audio_start|><unk>*N<|audio_end|>
Model inputs added: audio_features, audio_feature_lens, audio_bounds
Audio is truncated to 30s max; sampling_rate defaults to 16000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring states that audio is truncated to a maximum of 30 seconds, but the class constant MAX_AUDIO_SECONDS is set to 60. To avoid confusion, the docstring should be updated to reflect the actual value used in the code.

Suggested change
Audio is truncated to 30s max; sampling_rate defaults to 16000.
Audio is truncated to 60s max; sampling_rate defaults to 16000.

Comment on lines +354 to +358
# collate image/video fields from parent
res = {}
for k in ['pixel_values', 'image_bound', 'tgt_sizes']:
res[k] = self.gather_list(batch, k)
res.update(Template._data_collator(self, batch, padding_to=padding_to))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for collating image and video fields is duplicated from the parent class MiniCPMVTemplate. You can simplify this by calling super()._data_collator(...) to reuse the parent's implementation, which improves code reuse and maintainability.

Suggested change
# collate image/video fields from parent
res = {}
for k in ['pixel_values', 'image_bound', 'tgt_sizes']:
res[k] = self.gather_list(batch, k)
res.update(Template._data_collator(self, batch, padding_to=padding_to))
res = super()._data_collator(batch, padding_to=padding_to)

- Fix docstring: update audio truncation limit from 30s to 60s to match MAX_AUDIO_SECONDS constant
- Simplify _data_collator: replace duplicated image/video collation logic with super()._data_collator() call
@Jintao-Huang
Copy link
Collaborator

Thanks for the PR! Please write relevant test examples, referring to this: https://github.com/modelscope/ms-swift/blob/main/tests/test_align/test_template/test_audio.py#L58-L65

And run:

pip install pre-commit
pre-commit run --all-files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants