Protobuf 7.34 fix for model size > 2gib by hthadicherla · Pull Request #1058 · NVIDIA/Model-Optimizer

hthadicherla · 2026-03-17T12:21:59Z

What does this PR do?

Type of change: Bug fix

For the newer version of protobuf when model size > 2gb the model throws an error when serializing to string.

Traceback (most recent call last):
  File "quantize.py", line 643, in <module>
    main(args)
  File "quantize.py", line 428, in main
    quantized_onnx_model = quantize_int4(
        args.onnx_path,
        ...
        use_column_major=args.use_column_major,
    )
  File ".venv\Lib\site-packages\modelopt\onnx\quantization\int4.py", line 1529, in quantize
    onnx_model = _quantize_awq_lite(
        onnx_model,
        ...
    )
  File ".venv\Lib\site-packages\modelopt\onnx\quantization\int4.py", line 1088, in _quantize_awq_lite
    save_onnx(augmented_model, augmented_onnx_path, use_external_data_format)
  File ".venv\Lib\site-packages\modelopt\onnx\utils.py", line 646, in save_onnx
    model_proto = model.SerializeToString()
google.protobuf.message.EncodeError: Failed to serialize proto

Suggested fix is to force it to save as external data when this happens.

Summary by CodeRabbit

Bug Fixes
- Enhanced error handling when saving large models with improved error messaging and automatic external data storage fallback for serialization issues.

… it fails to serialize the model. Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai · 2026-03-17T12:22:16Z

📝 Walkthrough

Walkthrough

The save_onnx function in the ONNX utilities module now catches an additional google.protobuf.message.EncodeError exception alongside the existing ValueError. When either exception occurs during model serialization, external data storage is unconditionally enabled, with the error details included in the warning message.

Changes

Cohort / File(s)	Summary
Exception Handling Enhancement `modelopt/onnx/utils.py`	Expanded exception handling to catch `google.protobuf.message.EncodeError` in addition to `ValueError` when saving ONNX models. Updated warning message to include error details and unconditionally enable external data storage for large model serialization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the specific bug being fixed (Protobuf 7.34 compatibility) and the root cause condition (model size > 2GB), directly matching the PR's main objective to handle large model serialization.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	Pull request only modifies exception handling in modelopt/onnx/utils.py without introducing any of the six CRITICAL security anti-patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hthadicherla/onnx-protobuf-fix

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/utils.py`:
- Around line 654-658: The except block that currently catches ValueError and
google.protobuf.message.EncodeError and unconditionally sets
save_as_external_data and logs "Model exceeds 2GB limit" should instead
distinguish true size-overflow from schema/initialization failures: when
catching EncodeError (and ValueError), check the model's serialized size via
model.ByteSize() against the 2GB threshold (or the existing size_threshold
constant) and only set save_as_external_data and log the size message if
ByteSize() > size_threshold; otherwise re-raise the exception so
schema/initialization errors surface to the caller; update the logger call to
remain in the size branch and keep the original exception propagated in the
non-size branch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7da9dc33-34d9-4cc9-aea8-bd0bd2ce85bb

📥 Commits

Reviewing files that changed from the base of the PR and between 00fa5bd and b2c31e4.

📒 Files selected for processing (1)

modelopt/onnx/utils.py

coderabbitai · 2026-03-17T12:28:37Z

modelopt/onnx/utils.py

+    except (ValueError, google.protobuf.message.EncodeError) as e:
+        logger.warning(
+            "Model exceeds 2GB limit, switching to external data storage. Error message: [%s]", e
+        )
+        save_as_external_data = True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Python protobuf, under what conditions can google.protobuf.message.EncodeError be raised by SerializeToString(), and are all such failures size-related?

💡 Result:

SerializeToString() raises google.protobuf.message.EncodeError when the message is not “initialized”, i.e. any required (proto2) field is missing, including missing required fields in nested submessages; Python’s implementation explicitly checks IsInitialized() and raises EncodeError if it fails. (beam.apache.org)

These EncodeError failures are not size-related—they’re schema/initialization-related (proto2 required). (beam.apache.org)

Size-related serialization problems (e.g., exceeding protobuf’s ~2 GiB single-message limit) typically surface as other exception types (commonly ValueError/RuntimeError depending on wrapper/library), not EncodeError. (github.com)

Citations:

1: https://beam.apache.org/releases/pydoc/2.0.0/_modules/google/protobuf/internal/python_message.html

2: https://beam.apache.org/releases/pydoc/2.0.0/_modules/google/protobuf/internal/python_message.html

3: [Serve] ValueError: Message ray.serve.ReplicaConfig exceeds maximum protobuf size of 2GB ray-project/ray#32049

🏁 Script executed:

# First, let's locate and examine the file fd -t f "utils.py" "modelopt/onnx"

Repository: NVIDIA/Model-Optimizer

Length of output: 698

🏁 Script executed:

# Now let's read the context around lines 654-658 cat -n modelopt/onnx/utils.py | sed -n '640,670p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1593

🏁 Script executed:

# Let's see the entire save_onnx function to understand the full context ast-grep --pattern $'def save_onnx($$$) { $$$ }'

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check how save_onnx is called and what state the model is in rg -B5 "save_onnx\(" modelopt/onnx/ -A2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 2973

🏁 Script executed:

# Look for any validation of the model before save_onnx is called rg "model\.IsInitialized\|CheckInitialized" modelopt/onnx/

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if there are any proto2 vs proto3 requirements or model validation rg "required.*field\|proto.*version\|proto2\|proto3" modelopt/onnx/utils.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Get the full save_onnx function cat -n modelopt/onnx/utils.py | sed -n '643,685p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2179

🏁 Script executed:

# Check if there's any model validation before save_onnx is called rg -B10 "save_onnx\(" modelopt/onnx/utils.py | grep -A10 "def.*save_onnx\|ByteSize\|IsInitialized"

Repository: NVIDIA/Model-Optimizer

Length of output: 1250

🏁 Script executed:

# Look at the actual error handling - check if exception messages distinguish size from schema errors rg "EncodeError\|ValueError" modelopt/onnx/utils.py -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check ONNX protobuf usage patterns to understand if models should be validated before save rg "CheckInitialized\|IsInitialized" modelopt/ --include="*.py"

Repository: NVIDIA/Model-Optimizer

Length of output: 507

🏁 Script executed:

# Check what kind of exceptions the nested onnx.save_model call might raise rg "onnx\.save_model\|onnx\.save" modelopt/onnx/utils.py -A5 -B5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let's check what the protobuf version is and understand the actual error behavior rg "protobuf" modelopt/onnx/ -A1 -B1 | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1851

🏁 Script executed:

# Check if there are any test cases that show what exceptions can be raised fd -t f "test.*onnx\|.*test.*onnx" modelopt/ | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let's look at usage of save_onnx to understand context better rg "save_onnx" modelopt/onnx/utils.py -B2 -A2

Repository: NVIDIA/Model-Optimizer

Length of output: 1159

Distinguish size-limit errors from schema/initialization failures in exception handling.

At lines 654–658, catching both ValueError and EncodeError and treating all cases as size overflow is incorrect. Per protobuf semantics, EncodeError is raised when a message has uninitialized required fields (proto2 schema violations), not size issues. The current logging message "Model exceeds 2GB limit" misattributes these failures and masks actual data model problems. Only force external-data fallback when the actual message size exceeds the 2GB threshold.

Suggested fix: Before enabling external-data mode in the exception handler, verify that the serialization failure is actually size-related (e.g., by checking model.ByteSize() > size_threshold). If not, re-raise the exception to surface schema validation issues to the caller.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/utils.py` around lines 654 - 658, The except block that currently catches ValueError and google.protobuf.message.EncodeError and unconditionally sets save_as_external_data and logs "Model exceeds 2GB limit" should instead distinguish true size-overflow from schema/initialization failures: when catching EncodeError (and ValueError), check the model's serialized size via model.ByteSize() against the 2GB threshold (or the existing size_threshold constant) and only set save_as_external_data and log the size message if ByteSize() > size_threshold; otherwise re-raise the exception so schema/initialization errors surface to the caller; update the logger call to remain in the size branch and keep the original exception propagated in the non-size branch.

@hthadicherla is this comment valid?

codecov · 2026-03-17T12:33:55Z

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.32%. Comparing base (00fa5bd) to head (b2c31e4).

Files with missing lines	Patch %	Lines
modelopt/onnx/utils.py	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1058      +/-   ##
==========================================
+ Coverage   70.30%   70.32%   +0.02%     
==========================================
  Files         227      227              
  Lines       25847    25845       -2     
==========================================
+ Hits        18172    18176       +4     
+ Misses       7675     7669       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fixed protobuf error where for newer protobuf for model size > 2gib ,…

b2c31e4

… it fails to serialize the model. Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

hthadicherla requested a review from a team as a code owner March 17, 2026 12:22

hthadicherla requested a review from gcunhase March 17, 2026 12:22

hthadicherla requested a review from vishalpandya1990 March 17, 2026 12:22

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protobuf 7.34 fix for model size > 2gib#1058

Protobuf 7.34 fix for model size > 2gib#1058
hthadicherla wants to merge 1 commit intomainfrom
hthadicherla/onnx-protobuf-fix

hthadicherla commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

gcunhase Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hthadicherla commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gcunhase Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 17, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hthadicherla commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading