Support VLM calibration with image-text data #755

Edwardf0t1 · 2026-01-09T09:26:47Z

What does this PR do?

Type of change: New feature

Overview:

The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data.

New Feature: Adds support for VLM calibration specifically using image-text data.
Dataset Integration: Introduces support for sampling from the Nemotron-VLM-Dataset-v2.
Refactoring: Created a separate utility for VLM datasets to keep the main Hugging Face PTQ script (hf_ptq.py) clean.
Simplified logic for handling multimodal inputs.
Addressed specific issues encountered when calibrating the Nemotron-Nano-VL-12B-V2 model with image data.
Documentation: Updated the README to include instructions and examples for VLM calibration.

This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs.

Usage

python3 hf_ptq.py   --pyt_ckpt_path /home/scratch.omniml_data_2/models/Nemotron-Nano-VL-12B-V2   --qformat nvfp4   --export_path /home/omniml_data_3/zhiyuc/checkpoints/Nemotron-Nano-VL-12B-V2-NVFP4-doccalib   --trust_remote_code   --kv_cache_qformat none --calib_with_images   --vlm_dataset nemotron_vlm_dataset_v2   --vlm_subsets sparsetables,plotqa_cot   --calib_size 512

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: Not yet

Additional Information

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

copy-pr-bot · 2026-01-09T09:26:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

codecov · 2026-01-09T09:37:57Z

Codecov Report

❌ Patch coverage is 9.84615% with 293 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.13%. Comparing base (307fe71) to head (161fd56).
⚠️ Report is 33 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/utils/vlm_dataset_utils.py	8.37%	175 Missing ⚠️
modelopt/torch/utils/nemotron_vlm_dataset_utils.py	11.94%	118 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #755      +/-   ##
==========================================
- Coverage   74.66%   73.13%   -1.53%     
==========================================
  Files         192      193       +1     
  Lines       18975    19555     +580     
==========================================
+ Hits        14167    14302     +135     
- Misses       4808     5253     +445

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

shengliangxu · 2026-01-14T01:34:44Z

So, we only support image quantization for just nemotron-vl? If yes, why?

examples/llm_ptq/hf_ptq.py

cjluo-nv · 2026-01-14T17:00:12Z

modelopt/torch/utils/vlm_dataset_utils.py

 # limitations under the License.

-"""Utility functions for getting samples and forward loop function for different vlm datasets."""
+"""Utility functions for getting samples and dataloader for different VLM calibration datasets.


@ajrasane could you review this change?

examples/llm_ptq/README.md

cjluo-nv · 2026-01-14T17:02:44Z

@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset?

examples/llm_ptq/README.md

examples/llm_ptq/hf_ptq.py

Edwardf0t1 · 2026-01-16T06:34:38Z

So, we only support image quantization for just nemotron-vl? If yes, why?

At this time, only Nemotron VL has been tested. We can extend the logic to support other VLMs later. Note that different VLMs may have different forward functions—e.g., the way the vision encoder interacts with the language decoder can vary across models.

Do you have a preferred VL model you’d like us to support next? For instance, Qwen3-VL?

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 · 2026-01-20T17:55:11Z

@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset?

Tested on two benchmarks DocVQA and InfoVQA for Nemotron Nano VL v2 with vLLM backend:

BF16 Baseline: 94.2184, 79.1404
NVFP4 text-only calibration: 93.9472, 77.7221
NVFP4 image-text calibration: 94.0854, 77.9598

Image-text calibration is only marginally better in these cases, but the calibration flow in this PR should be ready. The follow-up experiments can be

Choose different subsets in Nemotron-VLM-Dataset-v2 or another image-text dataset for calibration
Check more evaluation metrics.
Run benchmarks on other VLMs such as Nemotron Parse, Qwen3-VL.

examples/llm_ptq/README.md

cjluo-nv · 2026-01-20T18:33:19Z

examples/llm_ptq/README.md

+  --qformat nvfp4 \
+  --export_path <quantized_ckpt_path> \
+  --trust_remote_code \
+  --calib_with_images \


qq: Can user choose which vlm dataset to use or we just provide one option

When --calib_with_images is used, the calibration dataset is hardcoded to nemotron_vlm_dataset_v2, it's a very large dataset and we can choose a few subsets from it.

Could you document the dataset name in the above description?

examples/llm_ptq/hf_ptq.py

cjluo-nv · 2026-01-20T18:40:11Z

modelopt/torch/utils/vlm_dataset_utils.py

+    #   prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    #   inputs = processor(text=[prompt], images=[pil_image], ...)
+
+    def _collate_fn(examples: list[dict[str, Any]]) -> dict[str, torch.Tensor] | dict[str, Any]:


why do we need to introduce these while the original one does not?

Previously we don't use image-text data for calibration, and standard dataLoader collation doesn't work for VLMs. A few reasons:

Dataset has inconsistent image formats

We need to convert conversational format to model input format.

Processor must process images and text together to align properly.

Should we create a class for this collate function?

class VLMCollator: def __init__(self, processor, dataset_name, require_image, max_length, device): self.processor = processor self.repo_id = ( SUPPORTED_VLM_DATASET_CONFIG[dataset_name]["config"]["path"] if dataset_name == "nemotron_vlm_dataset_v2" else None ) self.image_root = getattr(processor, "_modelopt_vlm_image_root", None) self.require_image = require_image self.max_length = max_length self.device = device def __call__(self, examples): # ... the collate logic

This would make it more readable and easier to test.

jingyu-ml

LGTM. I only reviewed the dataset processing part, which behaves as expected, loading the dataset on demand rather than downloading the entire dataset.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

cjluo-nv · 2026-01-22T07:49:37Z

examples/llm_ptq/hf_ptq.py

+            use_media_shards=True,
+            max_shards=1,
+        )
+    elif model_type == "mllama":


can this new dataset be used for mllama too? If yes, maybe we can remove this branch

cjluo-nv · 2026-01-22T07:51:18Z

examples/llm_ptq/hf_ptq.py

            device,
            trust_remote_code=args.trust_remote_code,
        )
+    elif is_nemotron_vl_model and args.calib_with_images:


is calib_with_images only working with is_nemotron_vl_model? And cannot be used for other VLMs?

cjluo-nv · 2026-01-22T07:51:58Z

examples/llm_ptq/hf_ptq.py

-        )
+        calibrate_loop = None
+        if use_calibration:
+            base_forward_loop = create_forward_loop(dataloader=calib_dataloader)


nit: you combine 514 and 520

ajrasane · 2026-01-22T22:46:51Z

modelopt/torch/utils/nemotron_vlm_dataset_utils.py

+            if not (isinstance(part, dict) and part.get("type") == "image"):
+                continue
+            if "image" in part:
+                return part["image"]
+            # fallback
+            for key in ("images", "path", "image_url", "url", "value", "data"):
+                if key in part:
+                    return part[key]


Can be simplified to:

if isinstance(part, dict) and part.get("type") == "image": for key in ("image", "images", "path", "image_url", "url", "value", "data"): if key in part: return part[key]

ajrasane · 2026-01-22T22:57:55Z

modelopt/torch/utils/nemotron_vlm_dataset_utils.py

+            for shard in shard_list:
+                if yielded_total >= self.num_samples or not needed:
+                    break
+                local_tar = hf_hub_download(


We are downloading the shards twice, once here and once on line 145. Is there a way we can cache the results downloaded on line 145?

ajrasane · 2026-01-22T23:10:19Z

modelopt/torch/utils/vlm_dataset_utils.py

+            if img is None:
+                img = ex.get("images", None)
+            if img is None and messages is not None:
+                img = _extract_first_image_from_messages(messages)
+            img = _maybe_load_image(img, repo_id=repo_id, image_root=image_root)


This logic is also used on line 291. Can we create a util:

def _get_image_from_example(ex: dict) -> Any: """Extract image from an example, checking common field names.""" img = ex.get("image") or ex.get("images") if img is None: img = _extract_first_image_from_messages(ex.get("messages")) return img

This will also simplify the lambda

ajrasane · 2026-01-22T23:13:01Z

modelopt/torch/utils/vlm_dataset_utils.py

+    #   prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    #   inputs = processor(text=[prompt], images=[pil_image], ...)
+
+    def _collate_fn(examples: list[dict[str, Any]]) -> dict[str, torch.Tensor] | dict[str, Any]:


Should we create a class for this collate function?

class VLMCollator: def __init__(self, processor, dataset_name, require_image, max_length, device): self.processor = processor self.repo_id = ( SUPPORTED_VLM_DATASET_CONFIG[dataset_name]["config"]["path"] if dataset_name == "nemotron_vlm_dataset_v2" else None ) self.image_root = getattr(processor, "_modelopt_vlm_image_root", None) self.require_image = require_image self.max_length = max_length self.device = device def __call__(self, examples): # ... the collate logic

This would make it more readable and easier to test.

ajrasane · 2026-01-23T00:41:23Z

examples/llm_ptq/nemotron_vl_calib.py

+
+    # Match the model's preferred vision dtype (usually bf16).
+    vision_dtype = None
+    with contextlib.suppress(Exception):


Could you specify the Exceptions to be suppressed? Same for the other calls.

ajrasane · 2026-01-23T00:44:35Z

modelopt/torch/utils/vlm_dataset_utils.py

 SUPPORTED_VLM_DATASET_CONFIG: dict[str, dict[str, Any]] = {
    "scienceqa": {"config": {"path": "derek-thomas/ScienceQA", "split": "train"}},
+    # Large multi-subset dataset (use streaming to avoid downloading the entire dataset)
+    "nemotron_vlm_dataset_v2": {
+        "config": {"path": "nvidia/Nemotron-VLM-Dataset-v2", "split": "train", "streaming": True},
+        # Provide a sane default that (a) includes in-repo media shards and (b) is document-centric.
+        # Subsets like docvqa_cot/chartqa_cot are JSONL-only in the dataset repo and require --vlm_image_root.
+        "default_subsets": ["sparsetables", "plotqa_cot", "wiki_en"],
+    },
 }


Should we create a dataclass for this? Something like:

@dataclass class VLMDatasetConfig: path: str split: str = "train" streaming: bool = False default_subsets: list[str] = field(default_factory=list) SUPPORTED_VLM_DATASETS = { "scienceqa": VLMDatasetConfig(path="derek-thomas/ScienceQA"), "nemotron_vlm_dataset_v2": VLMDatasetConfig( path="nvidia/Nemotron-VLM-Dataset-v2", streaming=True, default_subsets=["sparsetables", "plotqa_cot", "wiki_en"], ), }

ajrasane · 2026-01-23T00:46:12Z

modelopt/torch/utils/vlm_dataset_utils.py

+        cfg = SUPPORTED_VLM_DATASET_CONFIG[dataset_name]["config"].copy()
+        streaming = bool(cfg.pop("streaming", False))
+
+        if dataset_name == "nemotron_vlm_dataset_v2":


Should we move this logic to a different function like _get_nemotron_dataset()

Edwardf0t1 added 2 commits January 9, 2026 01:21

Add support for VLM calibration with image-text pair data

b9acc43

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Add support for VLM calibration with image-text pair data

528b51d

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 added 13 commits January 9, 2026 18:09

add support for sampling from Nemotron-VLM-Dataset-v2

3ef4b9d

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update readme

2d60f98

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix issues when calibrate with image data for Nemotron Nano VL

42a8406

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix issues when calibrate with image data for Nemotron Nano VL

7489a36

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix issues when calibrate with image data for Nemotron Nano VL

bd87154

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix issues when calibrate with image data for Nemotron Nano VL

3200a63

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

simplify

8964aa5

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

refactor to make hf_ptq cleaner, create a separate vlm dataset utils …

3b7373d

…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

refactor to make hf_ptq cleaner, create a separate vlm dataset utils …

5c774f9

…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update readme

f2774fc

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update readme

59d97a6

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update readme

e2e59f6

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

minor refactor

2a3868a

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 self-assigned this Jan 14, 2026

Edwardf0t1 marked this pull request as ready for review January 14, 2026 01:16

Edwardf0t1 requested review from a team as code owners January 14, 2026 01:16

Edwardf0t1 requested review from ChenhanYu, ajrasane, cjluo-nv, jingyu-ml, kaix-nv, meenchen, mxinO, realAsma and shengliangxu January 14, 2026 01:16

cjluo-nv reviewed Jan 14, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Show resolved Hide resolved

cjluo-nv reviewed Jan 14, 2026

View reviewed changes

examples/llm_ptq/README.md Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 14, 2026

View reviewed changes

examples/llm_ptq/README.md Show resolved Hide resolved

meenchen reviewed Jan 14, 2026

View reviewed changes

examples/llm_ptq/README.md Show resolved Hide resolved

examples/llm_ptq/hf_ptq.py Show resolved Hide resolved

address reviews

2611b0e

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>