Skip to content

feat: vision encoder spec for multimodal models#195

Open
pradhyum6144 wants to merge 5 commits intomodelpack:mainfrom
pradhyum6144:feat/vision-encoder-spec
Open

feat: vision encoder spec for multimodal models#195
pradhyum6144 wants to merge 5 commits intomodelpack:mainfrom
pradhyum6144:feat/vision-encoder-spec

Conversation

@pradhyum6144
Copy link
Contributor

Summary

  • Adds a vision encoder specification to describe how multimodal models process image and video inputs
  • The current spec supports capabilities.inputTypes: ["image"] but has no architectural description of the vision encoder — this fills that gap
  • Covers the three major vision-language patterns: CLIP-ViT (LLaVA), native ViT with mRoPE (Qwen2-VL), and cross-attention fusion (LLaMA-3.2 Vision)

New Files

File Description
schema/vision-encoder-schema.json JSON Schema for vision encoder architecture fields
docs/vision-encoder.md Spec document with field descriptions, model coverage matrix, and JSON example
tools/hf_parser.py Extended HF config parser with parse_vision_config() for multimodal models
tools/hf_parser_test.py 50 tests (26 decoder-only + 24 vision)

Vision Encoder Fields

  • Core: type, hidden_size, patch_size, image_size, num_layers, num_attention_heads
  • Projector: type (mlp/linear/cross_attention/perceiver), num_layers, activation
  • Special tokens: image_token_id, vision_start/end_token_id, video_token_id
  • Dynamic resolution: enabled, min/max_pixels, spatial_merge_size
  • Fusion: early (Qwen2-VL), late (LLaVA), cross_attention (LLaMA-3.2 Vision)
  • Position embedding: learned, rope, mrope (with per-modality sections)

Model Coverage

Model Encoder Patch Size Projector Fusion Special Features
LLaVA 1.5 CLIP-ViT-L/14 14 2-layer MLP late
Qwen2-VL ViT 14 early mRoPE, dynamic resolution, video
LLaMA-3.2 Vision CLIP-ViT 14 cross-attention cross_attention 8 gated cross-attn layers

Test plan

  • 50 tests pass (pytest tools/hf_parser_test.py -v)
  • LLaVA 1.5: CLIP-ViT encoder, MLP projector, late fusion, image_token_id
  • Qwen2-VL: native ViT, mRoPE, dynamic resolution, temporal patches, 5 special tokens
  • LLaMA-3.2 Vision: cross-attention fusion, 8 cross-attn layers counted
  • Text-only models: no vision_encoder in output, parse_vision_config returns None

Relates to #150
Builds on #193

)

Adds a vision encoder specification to describe how multimodal models
process image and video inputs. The current spec supports declaring
image modality via capabilities.inputTypes but has no architectural
description of the vision encoder.

New files:
- schema/vision-encoder-schema.json: JSON Schema for vision encoder
  fields (encoder type, patch size, projector, fusion type, special
  tokens, dynamic resolution, mRoPE)
- docs/vision-encoder.md: Spec document with field descriptions and
  model coverage matrix (LLaVA, Qwen2-VL, LLaMA-3.2 Vision, Gemma 2 VL)
- tools/hf_parser.py: Extended HF config parser with vision model
  support (parse_vision_config function)
- tools/hf_parser_test.py: 50 tests (26 decoder + 24 vision) covering
  LLaVA 1.5, Qwen2-VL, LLaMA-3.2 Vision, and text-only models

Relates to modelpack#150

Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ModelPack specification by introducing a detailed vision encoder architecture. Previously, the spec could declare image input capabilities but lacked the granular architectural details necessary for inference engines to properly handle visual inputs. The new specification provides structured metadata for vision encoders, projectors, and vision-language fusion mechanisms, supporting diverse multimodal model architectures. This update ensures that ModelPack can accurately represent and configure a broader range of modern multimodal models.

Highlights

  • Vision Encoder Specification: Introduced a comprehensive vision encoder specification to describe how multimodal models process image and video inputs, filling a gap in the existing ModelPack spec.
  • Multimodal Model Coverage: The specification covers key vision-language patterns, including CLIP-ViT (LLaVA), native ViT with mRoPE (Qwen2-VL), and cross-attention fusion (LLaMA-3.2 Vision).
  • New Schema and Documentation: Added a new JSON schema (schema/vision-encoder-schema.json) for vision encoder architecture fields and a detailed documentation file (docs/vision-encoder.md) explaining the specification.
  • HuggingFace Parser Extension: Extended the hf_parser.py tool with a parse_vision_config() function to extract and map vision encoder details from HuggingFace config.json files.
  • Extensive Testing: Included 50 new tests in hf_parser_test.py to ensure accurate parsing of vision encoder configurations for various multimodal models and to confirm text-only models are handled correctly.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new specification for vision encoder architectures in multimodal models, including a Markdown document, a JSON schema, and a Python parser for HuggingFace configurations. The review feedback suggests adding vision_token_id to the specification, schema, and parser to fully support models like Qwen2-VL. It also recommends refactoring the num_cross_attention_layers field into the projector object for better schema consistency, which would require corresponding updates in the documentation, schema, and parser. Additionally, a test case for vision_token_id is needed.

Comment on lines +101 to +119
- **special_tokens** _object_, OPTIONAL

Special token IDs used for image and video inputs in the tokenizer.

- **image_token_id** _integer_, OPTIONAL

The token ID used as a placeholder for image input in the text sequence.

- **vision_start_token_id** _integer_, OPTIONAL

The token ID marking the start of a vision region (used by models like Qwen2-VL).

- **vision_end_token_id** _integer_, OPTIONAL

The token ID marking the end of a vision region.

- **video_token_id** _integer_, OPTIONAL

The token ID for video frame placeholders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The specification for special_tokens is missing vision_token_id, which is used by models like Qwen2-VL. The PR description mentions that Qwen2-VL uses 5 special tokens, but only 4 are defined here and handled by the parser.

Please add vision_token_id to the special_tokens properties list to fully support the intended models.

Example:

- **vision_token_id** _integer_, OPTIONAL

  The token ID for a generic vision placeholder (used by models like Qwen2-VL).

Comment on lines +83 to +100
"properties": {
"image_token_id": {
"type": "integer",
"description": "Token ID used as a placeholder for image input"
},
"vision_start_token_id": {
"type": "integer",
"description": "Token ID marking the start of a vision region"
},
"vision_end_token_id": {
"type": "integer",
"description": "Token ID marking the end of a vision region"
},
"video_token_id": {
"type": "integer",
"description": "Token ID for video frame placeholder"
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The special_tokens properties are missing vision_token_id. Models like Qwen2-VL use this token, and it's mentioned in the PR description as one of the 5 special tokens for that model, but it's not included in the schema.

Please add it to the properties list to ensure the schema is complete:

"vision_token_id": {
  "type": "integer",
  "description": "Token ID for a generic vision placeholder (e.g., used by Qwen2-VL)"
}

Comment on lines +400 to +406
token_fields = {
"image_token_id": "special_tokens.image_token_id",
"image_token_index": "special_tokens.image_token_id",
"vision_start_token_id": "special_tokens.vision_start_token_id",
"vision_end_token_id": "special_tokens.vision_end_token_id",
"video_token_id": "special_tokens.video_token_id",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The token_fields dictionary is missing an entry for vision_token_id. The PR description mentions that Qwen2-VL uses 5 special tokens, and the test configuration QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it is not being parsed.

To fully support Qwen2-VL as intended, please:

  1. Add vision_token_id to the special_tokens object in the specification (docs/vision-encoder.md and schema/vision-encoder-schema.json).
  2. Add an entry to the token_fields map here to parse it:
"vision_token_id": "special_tokens.vision_token_id",
  1. Add an assertion for it in tools/hf_parser_test.py in TestQwen2VL.test_special_tokens.

Comment on lines +155 to +157
- **num_cross_attention_layers** _integer_, OPTIONAL

The number of cross-attention layers for vision-language fusion. Only applicable when `fusion_type` is `"cross_attention"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better consistency in the specification, consider moving this num_cross_attention_layers field into the projector object and unifying it with the existing num_layers field (defined on lines 93-95).

Currently, num_layers is specified for MLP projectors, while num_cross_attention_layers is a top-level field for cross-attention fusion. This separation can be confusing since both represent the number of layers for a specific projector type.

A more unified structure would be:

projector:
  type: string
  num_layers: integer # Number of layers for MLP, cross-attention, etc.
  ...

This would make the schema more modular and easier to understand. The description for projector.num_layers would need to be updated to be more generic, e.g., "The number of layers in the projector (for MLP or cross-attention type projectors)."

Comment on lines +134 to +137
"num_cross_attention_layers": {
"type": "integer",
"description": "Number of cross-attention layers for vision-language fusion (for cross_attention fusion type)"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve schema consistency, I recommend moving num_cross_attention_layers into the projector object and merging it with the existing num_layers field (defined on lines 69-72). This would make the structure more modular.

The projector.num_layers field is currently scoped to MLP projectors. It could be generalized to also cover cross-attention layers.

Suggested change:

  1. Remove this top-level num_cross_attention_layers property.
  2. Update the description of projector.properties.num_layers to something like: "Number of layers in the projector (e.g., for MLP or cross-attention projectors)."

This would centralize projector-related properties and align with the suggestion for docs/vision-encoder.md.

Comment on lines +417 to +420
if model_type == "mllama":
cross_attn_layers = raw.get("cross_attention_layers")
if cross_attn_layers is not None:
result["num_cross_attention_layers"] = len(cross_attn_layers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for parsing num_cross_attention_layers is consistent with the current spec. However, if you adopt the suggestion to move this property into the projector object for better modularity (see comment on docs/vision-encoder.md), this code will need to be updated.

The updated code would look like this:

if model_type == "mllama":
    cross_attn_layers = raw.get("cross_attention_layers")
    if cross_attn_layers is not None:
        _set_nested(result, "projector.num_layers", len(cross_attn_layers))

This would align the implementation with the proposed more consistent schema structure.

Comment on lines +459 to +465
def test_special_tokens(self):
spec = parse_hf_config(QWEN2_VL_CONFIG)
tokens = spec["vision_encoder"]["special_tokens"]
assert tokens["image_token_id"] == 151655
assert tokens["vision_start_token_id"] == 151652
assert tokens["vision_end_token_id"] == 151653
assert tokens["video_token_id"] == 151656
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is missing an assertion for vision_token_id. The QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it's not being parsed or tested.

Once the parser is updated to handle vision_token_id (as suggested in my comment on tools/hf_parser.py), please add an assertion here to ensure it's parsed correctly:

assert tokens["vision_token_id"] == 151654

Fixes MD040 markdown lint error (fenced-code-language).

Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
The example_test.go validator has no registered schema for the
vision-encoder mediatype, causing CI to fail. Remove the mediatype
tag until the validator is extended to support it.

Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
schema.go uses //go:embed *.json which embeds ALL json files in
schema/. The validator then requires each embedded file to have an
entry in specURLs, which vision-encoder-schema.json does not have.

Moved to docs/schemas/ to avoid breaking the Go embed/validation
pipeline until the vision encoder is integrated into the validator.

Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
- Add missing vision_token_id to spec, schema, and parser (Qwen2-VL
  uses 5 special tokens, not 4)
- Move num_cross_attention_layers into projector.num_layers for
  consistent schema structure across MLP and cross-attention projectors
- Add vision_token_id assertion in Qwen2-VL test
- Update projector.num_layers description to cover both MLP and
  cross-attention types

Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant