Add Mistral3 multimodal support with Pixtral vision encoder #431

dai-yamashita · 2025-12-19T15:51:28Z

Summary

This PR adds support for Mistral3 multimodal models (Ministral series):

Mistral text model extensions: attention_head_size, use_interleaved_attention, tie_word_embeddings
Pixtral vision encoder: Bumblebee.Vision.Pixtral
Mistral3 text decoder: Bumblebee.Text.Mistral3
Multimodal model: Bumblebee.Multimodal.Mistral3

Supported architectures

PixtralVisionModel
Mistral3Model, Mistral3ForCausalLM, Mistral3ForSequenceClassification
Mistral3ForConditionalGeneration (multimodal)

Key features

Function-based attention_window_size in transformer.ex for per-layer attention configuration
Interleaved attention pattern: even layers use global attention, odd layers use sliding window
Multimodal projector: patch merger + linear layers to project vision features to text embedding space

Test plan

All existing tests pass (264 tests, 0 failures)
New tests for Pixtral, Mistral3, and Multimodal added

This adds support for Mistral3 multimodal models (vision + text): - `Bumblebee.Vision.Pixtral`: Pixtral vision encoder with RoPE support - `Bumblebee.Text.Mistral3`: Mistral3 text decoder with interleaved attention - `Bumblebee.Multimodal.Mistral3`: Vision-language model combining Pixtral and Mistral3 with multimodal projector for image-conditioned generation - Ministral/Ministral3 variant support with interleaved attention - Devstral 2 (Ministral3) model support Supported architectures: - PixtralVisionModel - Mistral3Model, Mistral3ForCausalLM, Mistral3ForSequenceClassification - Mistral3ForConditionalGeneration (multimodal) - Ministral3ForCausalLM

jonatanklosko

Hey @dai-yamashita, this has multiple parts, so I would split it into multiple PR, for example:

Changes to the existing Mistral.
Pixtral model.
Mistral3 model.

I also dropped two high level comments inline.

Also note that for this to work end-to-end, and important piece is all the pixtral image processing. It may be worth first doing a proof of concept where an actual generation with images works, and then submit the pieces as PRs.

jonatanklosko · 2025-12-29T11:47:43Z

lib/bumblebee.ex

+    "Mistral3ForCausalLM" => {Bumblebee.Text.Mistral3, :for_causal_language_modeling},
+    "Mistral3ForSequenceClassification" =>
+      {Bumblebee.Text.Mistral3, :for_sequence_classification},


As far as I can tell Mistral3ForCausalLM and Mistral3ForSequenceClassification don't exist in hf/transformers. Looking at modeling_mistral.py there's only Mistral3ForConditionalGeneration and the corresponding Mistral3Model, which should be a single module in Bumblebee.

jonatanklosko · 2025-12-29T11:50:36Z

test/bumblebee/multimodal/mistral3_test.exs

+    # Expected values from Bumblebee inference with model generated by:
+    # test/fixtures/scripts/generate_expected_values.py (torch.manual_seed(42))
+    assert_all_close(
+      outputs.logits[[.., 1..3, 1..3]],
+      Nx.tensor([
+        [
+          [3.5014779567718506, -3.962040662765503, -4.744167327880859],


We should not assert against values from Bumblebee, because it doesn't verify the current implementation is correct. We want those reference values to be generated using Python transformers, this way we know the implementation behaves the same. See #422 for a example complete PR with tests.

@jonatanklosko

Thank you very much for your review.
We truly appreciate your thoughtful advice on various pending issues and how to proceed.

We will carefully address the points you raised and resubmit the request accordingly.

dai-yamashita force-pushed the add-mistral3-multimodal-support branch from dd11e25 to 911cc2d Compare December 20, 2025 11:21

dai-yamashita mentioned this pull request Dec 23, 2025

Support Mistral3 (ministral & Devstral ) #433

Closed

jonatanklosko reviewed Dec 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Mistral3 multimodal support with Pixtral vision encoder #431

Add Mistral3 multimodal support with Pixtral vision encoder #431

dai-yamashita commented Dec 19, 2025

Uh oh!

jonatanklosko left a comment

Uh oh!

jonatanklosko Dec 29, 2025

Uh oh!

jonatanklosko Dec 29, 2025

Uh oh!

dai-yamashita Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Mistral3 multimodal support with Pixtral vision encoder #431

Are you sure you want to change the base?

Add Mistral3 multimodal support with Pixtral vision encoder #431

Conversation

dai-yamashita commented Dec 19, 2025

Summary

Supported architectures

Key features

Test plan

Uh oh!

jonatanklosko left a comment

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

dai-yamashita Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants