feat: vision encoder spec for multimodal models by pradhyum6144 · Pull Request #195 · modelpack/model-spec

pradhyum6144 · 2026-03-23T18:36:54Z

Summary

Adds a vision encoder specification to describe how multimodal models process image and video inputs
The current spec supports capabilities.inputTypes: ["image"] but has no architectural description of the vision encoder — this fills that gap
Covers the three major vision-language patterns: CLIP-ViT (LLaVA), native ViT with mRoPE (Qwen2-VL), and cross-attention fusion (LLaMA-3.2 Vision)

New Files

File	Description
`schema/vision-encoder-schema.json`	JSON Schema for vision encoder architecture fields
`docs/vision-encoder.md`	Spec document with field descriptions, model coverage matrix, and JSON example
`tools/hf_parser.py`	Extended HF config parser with `parse_vision_config()` for multimodal models
`tools/hf_parser_test.py`	50 tests (26 decoder-only + 24 vision)

Vision Encoder Fields

Core: type, hidden_size, patch_size, image_size, num_layers, num_attention_heads
Projector: type (mlp/linear/cross_attention/perceiver), num_layers, activation
Special tokens: image_token_id, vision_start/end_token_id, video_token_id
Dynamic resolution: enabled, min/max_pixels, spatial_merge_size
Fusion: early (Qwen2-VL), late (LLaVA), cross_attention (LLaMA-3.2 Vision)
Position embedding: learned, rope, mrope (with per-modality sections)

Model Coverage

Model	Encoder	Patch Size	Projector	Fusion	Special Features
LLaVA 1.5	CLIP-ViT-L/14	14	2-layer MLP	late	—
Qwen2-VL	ViT	14	—	early	mRoPE, dynamic resolution, video
LLaMA-3.2 Vision	CLIP-ViT	14	cross-attention	cross_attention	8 gated cross-attn layers

Test plan

50 tests pass (pytest tools/hf_parser_test.py -v)
LLaVA 1.5: CLIP-ViT encoder, MLP projector, late fusion, image_token_id
Qwen2-VL: native ViT, mRoPE, dynamic resolution, temporal patches, 5 special tokens
LLaMA-3.2 Vision: cross-attention fusion, 8 cross-attn layers counted
Text-only models: no vision_encoder in output, parse_vision_config returns None

Relates to #150
Builds on #193

) Adds a vision encoder specification to describe how multimodal models process image and video inputs. The current spec supports declaring image modality via capabilities.inputTypes but has no architectural description of the vision encoder. New files: - schema/vision-encoder-schema.json: JSON Schema for vision encoder fields (encoder type, patch size, projector, fusion type, special tokens, dynamic resolution, mRoPE) - docs/vision-encoder.md: Spec document with field descriptions and model coverage matrix (LLaVA, Qwen2-VL, LLaMA-3.2 Vision, Gemma 2 VL) - tools/hf_parser.py: Extended HF config parser with vision model support (parse_vision_config function) - tools/hf_parser_test.py: 50 tests (26 decoder + 24 vision) covering LLaVA 1.5, Qwen2-VL, LLaMA-3.2 Vision, and text-only models Relates to modelpack#150 Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

gemini-code-assist · 2026-03-23T18:37:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ModelPack specification by introducing a detailed vision encoder architecture. Previously, the spec could declare image input capabilities but lacked the granular architectural details necessary for inference engines to properly handle visual inputs. The new specification provides structured metadata for vision encoders, projectors, and vision-language fusion mechanisms, supporting diverse multimodal model architectures. This update ensures that ModelPack can accurately represent and configure a broader range of modern multimodal models.

Highlights

Vision Encoder Specification: Introduced a comprehensive vision encoder specification to describe how multimodal models process image and video inputs, filling a gap in the existing ModelPack spec.
Multimodal Model Coverage: The specification covers key vision-language patterns, including CLIP-ViT (LLaVA), native ViT with mRoPE (Qwen2-VL), and cross-attention fusion (LLaMA-3.2 Vision).
New Schema and Documentation: Added a new JSON schema (schema/vision-encoder-schema.json) for vision encoder architecture fields and a detailed documentation file (docs/vision-encoder.md) explaining the specification.
HuggingFace Parser Extension: Extended the hf_parser.py tool with a parse_vision_config() function to extract and map vision encoder details from HuggingFace config.json files.
Extensive Testing: Included 50 new tests in hf_parser_test.py to ensure accurate parsing of vision encoder configurations for various multimodal models and to confirm text-only models are handled correctly.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new specification for vision encoder architectures in multimodal models, including a Markdown document, a JSON schema, and a Python parser for HuggingFace configurations. The review feedback suggests adding vision_token_id to the specification, schema, and parser to fully support models like Qwen2-VL. It also recommends refactoring the num_cross_attention_layers field into the projector object for better schema consistency, which would require corresponding updates in the documentation, schema, and parser. Additionally, a test case for vision_token_id is needed.

gemini-code-assist · 2026-03-23T18:39:20Z

+- **special_tokens** _object_, OPTIONAL
+
+  Special token IDs used for image and video inputs in the tokenizer.
+
+  - **image_token_id** _integer_, OPTIONAL
+
+    The token ID used as a placeholder for image input in the text sequence.
+
+  - **vision_start_token_id** _integer_, OPTIONAL
+
+    The token ID marking the start of a vision region (used by models like Qwen2-VL).
+
+  - **vision_end_token_id** _integer_, OPTIONAL
+
+    The token ID marking the end of a vision region.
+
+  - **video_token_id** _integer_, OPTIONAL
+
+    The token ID for video frame placeholders.


The specification for special_tokens is missing vision_token_id, which is used by models like Qwen2-VL. The PR description mentions that Qwen2-VL uses 5 special tokens, but only 4 are defined here and handled by the parser.

Please add vision_token_id to the special_tokens properties list to fully support the intended models.

Example:

- **vision_token_id** _integer_, OPTIONAL The token ID for a generic vision placeholder (used by models like Qwen2-VL).

gemini-code-assist · 2026-03-23T18:39:20Z

+        "properties": {
+          "image_token_id": {
+            "type": "integer",
+            "description": "Token ID used as a placeholder for image input"
+          },
+          "vision_start_token_id": {
+            "type": "integer",
+            "description": "Token ID marking the start of a vision region"
+          },
+          "vision_end_token_id": {
+            "type": "integer",
+            "description": "Token ID marking the end of a vision region"
+          },
+          "video_token_id": {
+            "type": "integer",
+            "description": "Token ID for video frame placeholder"
+          }
+        },


The special_tokens properties are missing vision_token_id. Models like Qwen2-VL use this token, and it's mentioned in the PR description as one of the 5 special tokens for that model, but it's not included in the schema.

Please add it to the properties list to ensure the schema is complete:

"vision_token_id": { "type": "integer", "description": "Token ID for a generic vision placeholder (e.g., used by Qwen2-VL)" }

gemini-code-assist · 2026-03-23T18:39:20Z

+    token_fields = {
+        "image_token_id": "special_tokens.image_token_id",
+        "image_token_index": "special_tokens.image_token_id",
+        "vision_start_token_id": "special_tokens.vision_start_token_id",
+        "vision_end_token_id": "special_tokens.vision_end_token_id",
+        "video_token_id": "special_tokens.video_token_id",
+    }


The token_fields dictionary is missing an entry for vision_token_id. The PR description mentions that Qwen2-VL uses 5 special tokens, and the test configuration QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it is not being parsed.

To fully support Qwen2-VL as intended, please:

Add vision_token_id to the special_tokens object in the specification (docs/vision-encoder.md and schema/vision-encoder-schema.json).

Add an entry to the token_fields map here to parse it:

"vision_token_id": "special_tokens.vision_token_id",

Add an assertion for it in tools/hf_parser_test.py in TestQwen2VL.test_special_tokens.

gemini-code-assist · 2026-03-23T18:39:20Z

+- **num_cross_attention_layers** _integer_, OPTIONAL
+
+  The number of cross-attention layers for vision-language fusion. Only applicable when `fusion_type` is `"cross_attention"`.


For better consistency in the specification, consider moving this num_cross_attention_layers field into the projector object and unifying it with the existing num_layers field (defined on lines 93-95).

Currently, num_layers is specified for MLP projectors, while num_cross_attention_layers is a top-level field for cross-attention fusion. This separation can be confusing since both represent the number of layers for a specific projector type.

A more unified structure would be:

projector: type: string num_layers: integer # Number of layers for MLP, cross-attention, etc. ...

This would make the schema more modular and easier to understand. The description for projector.num_layers would need to be updated to be more generic, e.g., "The number of layers in the projector (for MLP or cross-attention type projectors)."

gemini-code-assist · 2026-03-23T18:39:20Z

+      "num_cross_attention_layers": {
+        "type": "integer",
+        "description": "Number of cross-attention layers for vision-language fusion (for cross_attention fusion type)"
+      },


To improve schema consistency, I recommend moving num_cross_attention_layers into the projector object and merging it with the existing num_layers field (defined on lines 69-72). This would make the structure more modular.

The projector.num_layers field is currently scoped to MLP projectors. It could be generalized to also cover cross-attention layers.

Suggested change:

Remove this top-level num_cross_attention_layers property.

Update the description of projector.properties.num_layers to something like: "Number of layers in the projector (e.g., for MLP or cross-attention projectors)."

This would centralize projector-related properties and align with the suggestion for docs/vision-encoder.md.

gemini-code-assist · 2026-03-23T18:39:20Z

+    if model_type == "mllama":
+        cross_attn_layers = raw.get("cross_attention_layers")
+        if cross_attn_layers is not None:
+            result["num_cross_attention_layers"] = len(cross_attn_layers)


This logic for parsing num_cross_attention_layers is consistent with the current spec. However, if you adopt the suggestion to move this property into the projector object for better modularity (see comment on docs/vision-encoder.md), this code will need to be updated.

The updated code would look like this:

if model_type == "mllama": cross_attn_layers = raw.get("cross_attention_layers") if cross_attn_layers is not None: _set_nested(result, "projector.num_layers", len(cross_attn_layers))

This would align the implementation with the proposed more consistent schema structure.

gemini-code-assist · 2026-03-23T18:39:20Z

+    def test_special_tokens(self):
+        spec = parse_hf_config(QWEN2_VL_CONFIG)
+        tokens = spec["vision_encoder"]["special_tokens"]
+        assert tokens["image_token_id"] == 151655
+        assert tokens["vision_start_token_id"] == 151652
+        assert tokens["vision_end_token_id"] == 151653
+        assert tokens["video_token_id"] == 151656


This test is missing an assertion for vision_token_id. The QWEN2_VL_CONFIG includes "vision_token_id": 151654, but it's not being parsed or tested.

Once the parser is updated to handle vision_token_id (as suggested in my comment on tools/hf_parser.py), please add an assertion here to ensure it's parsed correctly:

assert tokens["vision_token_id"] == 151654

Fixes MD040 markdown lint error (fenced-code-language). Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

The example_test.go validator has no registered schema for the vision-encoder mediatype, causing CI to fail. Remove the mediatype tag until the validator is extended to support it. Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

schema.go uses //go:embed *.json which embeds ALL json files in schema/. The validator then requires each embedded file to have an entry in specURLs, which vision-encoder-schema.json does not have. Moved to docs/schemas/ to avoid breaking the Go embed/validation pipeline until the vision encoder is integrated into the validator. Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

- Add missing vision_token_id to spec, schema, and parser (Qwen2-VL uses 5 special tokens, not 4) - Move num_cross_attention_layers into projector.num_layers for consistent schema structure across MLP and cross-attention projectors - Add vision_token_id assertion in Qwen2-VL test - Update projector.num_layers description to cover both MLP and cross-attention types Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

github-actions · 2026-05-03T00:55:44Z

This PR is stale because it has been open 40 days with no activity.

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

pradhyum6144 added 4 commits March 24, 2026 00:09

fix: add language specifier to fenced code block in vision-encoder.md

1be8d44

Fixes MD040 markdown lint error (fenced-code-language). Signed-off-by: pradhyum6144 <pradhyum314@gmail.com>

github-actions Bot added the stale label May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vision encoder spec for multimodal models#195

feat: vision encoder spec for multimodal models#195
pradhyum6144 wants to merge 5 commits into
modelpack:mainfrom
pradhyum6144:feat/vision-encoder-spec

pradhyum6144 commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- num_cross_attention_layers _integer_, OPTIONAL

		The number of cross-attention layers for vision-language fusion. Only applicable when `fusion_type` is `"cross_attention"`.

Conversation

pradhyum6144 commented Mar 23, 2026

Summary

New Files

Vision Encoder Fields

Model Coverage

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant