feat: vision encoder spec for multimodal models#195
feat: vision encoder spec for multimodal models#195pradhyum6144 wants to merge 5 commits intomodelpack:mainfrom
Conversation
) Adds a vision encoder specification to describe how multimodal models process image and video inputs. The current spec supports declaring image modality via capabilities.inputTypes but has no architectural description of the vision encoder. New files: - schema/vision-encoder-schema.json: JSON Schema for vision encoder fields (encoder type, patch size, projector, fusion type, special tokens, dynamic resolution, mRoPE) - docs/vision-encoder.md: Spec document with field descriptions and model coverage matrix (LLaVA, Qwen2-VL, LLaMA-3.2 Vision, Gemma 2 VL) - tools/hf_parser.py: Extended HF config parser with vision model support (parse_vision_config function) - tools/hf_parser_test.py: 50 tests (26 decoder + 24 vision) covering LLaVA 1.5, Qwen2-VL, LLaMA-3.2 Vision, and text-only models Relates to modelpack#150 Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the ModelPack specification by introducing a detailed vision encoder architecture. Previously, the spec could declare image input capabilities but lacked the granular architectural details necessary for inference engines to properly handle visual inputs. The new specification provides structured metadata for vision encoders, projectors, and vision-language fusion mechanisms, supporting diverse multimodal model architectures. This update ensures that ModelPack can accurately represent and configure a broader range of modern multimodal models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new specification for vision encoder architectures in multimodal models, including a Markdown document, a JSON schema, and a Python parser for HuggingFace configurations. The review feedback suggests adding vision_token_id to the specification, schema, and parser to fully support models like Qwen2-VL. It also recommends refactoring the num_cross_attention_layers field into the projector object for better schema consistency, which would require corresponding updates in the documentation, schema, and parser. Additionally, a test case for vision_token_id is needed.
| - **special_tokens** _object_, OPTIONAL | ||
|
|
||
| Special token IDs used for image and video inputs in the tokenizer. | ||
|
|
||
| - **image_token_id** _integer_, OPTIONAL | ||
|
|
||
| The token ID used as a placeholder for image input in the text sequence. | ||
|
|
||
| - **vision_start_token_id** _integer_, OPTIONAL | ||
|
|
||
| The token ID marking the start of a vision region (used by models like Qwen2-VL). | ||
|
|
||
| - **vision_end_token_id** _integer_, OPTIONAL | ||
|
|
||
| The token ID marking the end of a vision region. | ||
|
|
||
| - **video_token_id** _integer_, OPTIONAL | ||
|
|
||
| The token ID for video frame placeholders. |
There was a problem hiding this comment.
The specification for special_tokens is missing vision_token_id, which is used by models like Qwen2-VL. The PR description mentions that Qwen2-VL uses 5 special tokens, but only 4 are defined here and handled by the parser.
Please add vision_token_id to the special_tokens properties list to fully support the intended models.
Example:
- **vision_token_id** _integer_, OPTIONAL
The token ID for a generic vision placeholder (used by models like Qwen2-VL).| "properties": { | ||
| "image_token_id": { | ||
| "type": "integer", | ||
| "description": "Token ID used as a placeholder for image input" | ||
| }, | ||
| "vision_start_token_id": { | ||
| "type": "integer", | ||
| "description": "Token ID marking the start of a vision region" | ||
| }, | ||
| "vision_end_token_id": { | ||
| "type": "integer", | ||
| "description": "Token ID marking the end of a vision region" | ||
| }, | ||
| "video_token_id": { | ||
| "type": "integer", | ||
| "description": "Token ID for video frame placeholder" | ||
| } | ||
| }, |
There was a problem hiding this comment.
The special_tokens properties are missing vision_token_id. Models like Qwen2-VL use this token, and it's mentioned in the PR description as one of the 5 special tokens for that model, but it's not included in the schema.
Please add it to the properties list to ensure the schema is complete:
"vision_token_id": {
"type": "integer",
"description": "Token ID for a generic vision placeholder (e.g., used by Qwen2-VL)"
}| token_fields = { | ||
| "image_token_id": "special_tokens.image_token_id", | ||
| "image_token_index": "special_tokens.image_token_id", | ||
| "vision_start_token_id": "special_tokens.vision_start_token_id", | ||
| "vision_end_token_id": "special_tokens.vision_end_token_id", | ||
| "video_token_id": "special_tokens.video_token_id", | ||
| } |
There was a problem hiding this comment.
The token_fields dictionary is missing an entry for vision_token_id. The PR description mentions that Qwen2-VL uses 5 special tokens, and the test configuration QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it is not being parsed.
To fully support Qwen2-VL as intended, please:
- Add
vision_token_idto thespecial_tokensobject in the specification (docs/vision-encoder.mdandschema/vision-encoder-schema.json). - Add an entry to the
token_fieldsmap here to parse it:
"vision_token_id": "special_tokens.vision_token_id",- Add an assertion for it in
tools/hf_parser_test.pyinTestQwen2VL.test_special_tokens.
docs/vision-encoder.md
Outdated
| - **num_cross_attention_layers** _integer_, OPTIONAL | ||
|
|
||
| The number of cross-attention layers for vision-language fusion. Only applicable when `fusion_type` is `"cross_attention"`. |
There was a problem hiding this comment.
For better consistency in the specification, consider moving this num_cross_attention_layers field into the projector object and unifying it with the existing num_layers field (defined on lines 93-95).
Currently, num_layers is specified for MLP projectors, while num_cross_attention_layers is a top-level field for cross-attention fusion. This separation can be confusing since both represent the number of layers for a specific projector type.
A more unified structure would be:
projector:
type: string
num_layers: integer # Number of layers for MLP, cross-attention, etc.
...This would make the schema more modular and easier to understand. The description for projector.num_layers would need to be updated to be more generic, e.g., "The number of layers in the projector (for MLP or cross-attention type projectors)."
| "num_cross_attention_layers": { | ||
| "type": "integer", | ||
| "description": "Number of cross-attention layers for vision-language fusion (for cross_attention fusion type)" | ||
| }, |
There was a problem hiding this comment.
To improve schema consistency, I recommend moving num_cross_attention_layers into the projector object and merging it with the existing num_layers field (defined on lines 69-72). This would make the structure more modular.
The projector.num_layers field is currently scoped to MLP projectors. It could be generalized to also cover cross-attention layers.
Suggested change:
- Remove this top-level
num_cross_attention_layersproperty. - Update the description of
projector.properties.num_layersto something like: "Number of layers in the projector (e.g., for MLP or cross-attention projectors)."
This would centralize projector-related properties and align with the suggestion for docs/vision-encoder.md.
tools/hf_parser.py
Outdated
| if model_type == "mllama": | ||
| cross_attn_layers = raw.get("cross_attention_layers") | ||
| if cross_attn_layers is not None: | ||
| result["num_cross_attention_layers"] = len(cross_attn_layers) |
There was a problem hiding this comment.
This logic for parsing num_cross_attention_layers is consistent with the current spec. However, if you adopt the suggestion to move this property into the projector object for better modularity (see comment on docs/vision-encoder.md), this code will need to be updated.
The updated code would look like this:
if model_type == "mllama":
cross_attn_layers = raw.get("cross_attention_layers")
if cross_attn_layers is not None:
_set_nested(result, "projector.num_layers", len(cross_attn_layers))This would align the implementation with the proposed more consistent schema structure.
| def test_special_tokens(self): | ||
| spec = parse_hf_config(QWEN2_VL_CONFIG) | ||
| tokens = spec["vision_encoder"]["special_tokens"] | ||
| assert tokens["image_token_id"] == 151655 | ||
| assert tokens["vision_start_token_id"] == 151652 | ||
| assert tokens["vision_end_token_id"] == 151653 | ||
| assert tokens["video_token_id"] == 151656 |
There was a problem hiding this comment.
This test is missing an assertion for vision_token_id. The QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it's not being parsed or tested.
Once the parser is updated to handle vision_token_id (as suggested in my comment on tools/hf_parser.py), please add an assertion here to ensure it's parsed correctly:
assert tokens["vision_token_id"] == 151654Fixes MD040 markdown lint error (fenced-code-language). Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
The example_test.go validator has no registered schema for the vision-encoder mediatype, causing CI to fail. Remove the mediatype tag until the validator is extended to support it. Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
schema.go uses //go:embed *.json which embeds ALL json files in schema/. The validator then requires each embedded file to have an entry in specURLs, which vision-encoder-schema.json does not have. Moved to docs/schemas/ to avoid breaking the Go embed/validation pipeline until the vision encoder is integrated into the validator. Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
- Add missing vision_token_id to spec, schema, and parser (Qwen2-VL uses 5 special tokens, not 4) - Move num_cross_attention_layers into projector.num_layers for consistent schema structure across MLP and cross-attention projectors - Add vision_token_id assertion in Qwen2-VL test - Update projector.num_layers description to cover both MLP and cross-attention types Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>
Summary
capabilities.inputTypes: ["image"]but has no architectural description of the vision encoder — this fills that gapNew Files
schema/vision-encoder-schema.jsondocs/vision-encoder.mdtools/hf_parser.pyparse_vision_config()for multimodal modelstools/hf_parser_test.pyVision Encoder Fields
Model Coverage
Test plan
pytest tools/hf_parser_test.py -v)Relates to #150
Builds on #193