Skip to content

Conversation

@dai-yamashita
Copy link

Summary

This PR adds support for Mistral3 multimodal models (Ministral series):

  • Mistral text model extensions: attention_head_size, use_interleaved_attention, tie_word_embeddings
  • Pixtral vision encoder: Bumblebee.Vision.Pixtral
  • Mistral3 text decoder: Bumblebee.Text.Mistral3
  • Multimodal model: Bumblebee.Multimodal.Mistral3

Supported architectures

  • PixtralVisionModel
  • Mistral3Model, Mistral3ForCausalLM, Mistral3ForSequenceClassification
  • Mistral3ForConditionalGeneration (multimodal)

Key features

  1. Function-based attention_window_size in transformer.ex for per-layer attention configuration
  2. Interleaved attention pattern: even layers use global attention, odd layers use sliding window
  3. Multimodal projector: patch merger + linear layers to project vision features to text embedding space

Test plan

  • All existing tests pass (264 tests, 0 failures)
  • New tests for Pixtral, Mistral3, and Multimodal added

This adds support for Mistral3 multimodal models (vision + text):

- `Bumblebee.Vision.Pixtral`: Pixtral vision encoder with RoPE support
- `Bumblebee.Text.Mistral3`: Mistral3 text decoder with interleaved attention
- `Bumblebee.Multimodal.Mistral3`: Vision-language model combining Pixtral
  and Mistral3 with multimodal projector for image-conditioned generation
- Ministral/Ministral3 variant support with interleaved attention
- Devstral 2 (Ministral3) model support

Supported architectures:
- PixtralVisionModel
- Mistral3Model, Mistral3ForCausalLM, Mistral3ForSequenceClassification
- Mistral3ForConditionalGeneration (multimodal)
- Ministral3ForCausalLM
Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dai-yamashita, this has multiple parts, so I would split it into multiple PR, for example:

  1. Changes to the existing Mistral.
  2. Pixtral model.
  3. Mistral3 model.

I also dropped two high level comments inline.

Also note that for this to work end-to-end, and important piece is all the pixtral image processing. It may be worth first doing a proof of concept where an actual generation with images works, and then submit the pieces as PRs.

Comment on lines +174 to +176
"Mistral3ForCausalLM" => {Bumblebee.Text.Mistral3, :for_causal_language_modeling},
"Mistral3ForSequenceClassification" =>
{Bumblebee.Text.Mistral3, :for_sequence_classification},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell Mistral3ForCausalLM and Mistral3ForSequenceClassification don't exist in hf/transformers. Looking at modeling_mistral.py there's only Mistral3ForConditionalGeneration and the corresponding Mistral3Model, which should be a single module in Bumblebee.

Comment on lines +28 to +34
# Expected values from Bumblebee inference with model generated by:
# test/fixtures/scripts/generate_expected_values.py (torch.manual_seed(42))
assert_all_close(
outputs.logits[[.., 1..3, 1..3]],
Nx.tensor([
[
[3.5014779567718506, -3.962040662765503, -4.744167327880859],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not assert against values from Bumblebee, because it doesn't verify the current implementation is correct. We want those reference values to be generated using Python transformers, this way we know the implementation behaves the same. See #422 for a example complete PR with tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatanklosko

Thank you very much for your review.
We truly appreciate your thoughtful advice on various pending issues and how to proceed.

We will carefully address the points you raised and resubmit the request accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants