Skip to content

Protobuf 7.34 fix for model size > 2gib#1058

Open
hthadicherla wants to merge 1 commit intomainfrom
hthadicherla/onnx-protobuf-fix
Open

Protobuf 7.34 fix for model size > 2gib#1058
hthadicherla wants to merge 1 commit intomainfrom
hthadicherla/onnx-protobuf-fix

Conversation

@hthadicherla
Copy link
Contributor

@hthadicherla hthadicherla commented Mar 17, 2026

What does this PR do?

Type of change: Bug fix

For the newer version of protobuf when model size > 2gb the model throws an error when serializing to string.

Traceback (most recent call last):
  File "quantize.py", line 643, in <module>
    main(args)
  File "quantize.py", line 428, in main
    quantized_onnx_model = quantize_int4(
        args.onnx_path,
        ...
        use_column_major=args.use_column_major,
    )
  File ".venv\Lib\site-packages\modelopt\onnx\quantization\int4.py", line 1529, in quantize
    onnx_model = _quantize_awq_lite(
        onnx_model,
        ...
    )
  File ".venv\Lib\site-packages\modelopt\onnx\quantization\int4.py", line 1088, in _quantize_awq_lite
    save_onnx(augmented_model, augmented_onnx_path, use_external_data_format)
  File ".venv\Lib\site-packages\modelopt\onnx\utils.py", line 646, in save_onnx
    model_proto = model.SerializeToString()
google.protobuf.message.EncodeError: Failed to serialize proto

Suggested fix is to force it to save as external data when this happens.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error handling when saving large models with improved error messaging and automatic external data storage fallback for serialization issues.

… it fails to serialize the model.

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
@hthadicherla hthadicherla requested a review from a team as a code owner March 17, 2026 12:22
@hthadicherla hthadicherla requested a review from gcunhase March 17, 2026 12:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

The save_onnx function in the ONNX utilities module now catches an additional google.protobuf.message.EncodeError exception alongside the existing ValueError. When either exception occurs during model serialization, external data storage is unconditionally enabled, with the error details included in the warning message.

Changes

Cohort / File(s) Summary
Exception Handling Enhancement
modelopt/onnx/utils.py
Expanded exception handling to catch google.protobuf.message.EncodeError in addition to ValueError when saving ONNX models. Updated warning message to include error details and unconditionally enable external data storage for large model serialization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the specific bug being fixed (Protobuf 7.34 compatibility) and the root cause condition (model size > 2GB), directly matching the PR's main objective to handle large model serialization.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed Pull request only modifies exception handling in modelopt/onnx/utils.py without introducing any of the six CRITICAL security anti-patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hthadicherla/onnx-protobuf-fix
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/utils.py`:
- Around line 654-658: The except block that currently catches ValueError and
google.protobuf.message.EncodeError and unconditionally sets
save_as_external_data and logs "Model exceeds 2GB limit" should instead
distinguish true size-overflow from schema/initialization failures: when
catching EncodeError (and ValueError), check the model's serialized size via
model.ByteSize() against the 2GB threshold (or the existing size_threshold
constant) and only set save_as_external_data and log the size message if
ByteSize() > size_threshold; otherwise re-raise the exception so
schema/initialization errors surface to the caller; update the logger call to
remain in the size branch and keep the original exception propagated in the
non-size branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7da9dc33-34d9-4cc9-aea8-bd0bd2ce85bb

📥 Commits

Reviewing files that changed from the base of the PR and between 00fa5bd and b2c31e4.

📒 Files selected for processing (1)
  • modelopt/onnx/utils.py

Comment on lines +654 to +658
except (ValueError, google.protobuf.message.EncodeError) as e:
logger.warning(
"Model exceeds 2GB limit, switching to external data storage. Error message: [%s]", e
)
save_as_external_data = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Python protobuf, under what conditions can google.protobuf.message.EncodeError be raised by SerializeToString(), and are all such failures size-related?

💡 Result:

SerializeToString() raises google.protobuf.message.EncodeError when the message is not “initialized”, i.e. any required (proto2) field is missing, including missing required fields in nested submessages; Python’s implementation explicitly checks IsInitialized() and raises EncodeError if it fails. (beam.apache.org)

These EncodeError failures are not size-related—they’re schema/initialization-related (proto2 required). (beam.apache.org)

Size-related serialization problems (e.g., exceeding protobuf’s ~2 GiB single-message limit) typically surface as other exception types (commonly ValueError/RuntimeError depending on wrapper/library), not EncodeError. (github.com)

Citations:


🏁 Script executed:

# First, let's locate and examine the file
fd -t f "utils.py" "modelopt/onnx"

Repository: NVIDIA/Model-Optimizer

Length of output: 698


🏁 Script executed:

# Now let's read the context around lines 654-658
cat -n modelopt/onnx/utils.py | sed -n '640,670p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1593


🏁 Script executed:

# Let's see the entire save_onnx function to understand the full context
ast-grep --pattern $'def save_onnx($$$) {
  $$$
}'

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check how save_onnx is called and what state the model is in
rg -B5 "save_onnx\(" modelopt/onnx/ -A2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 2973


🏁 Script executed:

# Look for any validation of the model before save_onnx is called
rg "model\.IsInitialized\|CheckInitialized" modelopt/onnx/

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check if there are any proto2 vs proto3 requirements or model validation
rg "required.*field\|proto.*version\|proto2\|proto3" modelopt/onnx/utils.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Get the full save_onnx function
cat -n modelopt/onnx/utils.py | sed -n '643,685p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2179


🏁 Script executed:

# Check if there's any model validation before save_onnx is called
rg -B10 "save_onnx\(" modelopt/onnx/utils.py | grep -A10 "def.*save_onnx\|ByteSize\|IsInitialized"

Repository: NVIDIA/Model-Optimizer

Length of output: 1250


🏁 Script executed:

# Look at the actual error handling - check if exception messages distinguish size from schema errors
rg "EncodeError\|ValueError" modelopt/onnx/utils.py -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check ONNX protobuf usage patterns to understand if models should be validated before save
rg "CheckInitialized\|IsInitialized" modelopt/ --include="*.py"

Repository: NVIDIA/Model-Optimizer

Length of output: 507


🏁 Script executed:

# Check what kind of exceptions the nested onnx.save_model call might raise
rg "onnx\.save_model\|onnx\.save" modelopt/onnx/utils.py -A5 -B5

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Let's check what the protobuf version is and understand the actual error behavior
rg "protobuf" modelopt/onnx/ -A1 -B1 | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1851


🏁 Script executed:

# Check if there are any test cases that show what exceptions can be raised
fd -t f "test.*onnx\|.*test.*onnx" modelopt/ | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Let's look at usage of save_onnx to understand context better
rg "save_onnx" modelopt/onnx/utils.py -B2 -A2

Repository: NVIDIA/Model-Optimizer

Length of output: 1159


Distinguish size-limit errors from schema/initialization failures in exception handling.

At lines 654–658, catching both ValueError and EncodeError and treating all cases as size overflow is incorrect. Per protobuf semantics, EncodeError is raised when a message has uninitialized required fields (proto2 schema violations), not size issues. The current logging message "Model exceeds 2GB limit" misattributes these failures and masks actual data model problems. Only force external-data fallback when the actual message size exceeds the 2GB threshold.

Suggested fix: Before enabling external-data mode in the exception handler, verify that the serialization failure is actually size-related (e.g., by checking model.ByteSize() > size_threshold). If not, re-raise the exception to surface schema validation issues to the caller.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 654 - 658, The except block that
currently catches ValueError and google.protobuf.message.EncodeError and
unconditionally sets save_as_external_data and logs "Model exceeds 2GB limit"
should instead distinguish true size-overflow from schema/initialization
failures: when catching EncodeError (and ValueError), check the model's
serialized size via model.ByteSize() against the 2GB threshold (or the existing
size_threshold constant) and only set save_as_external_data and log the size
message if ByteSize() > size_threshold; otherwise re-raise the exception so
schema/initialization errors surface to the caller; update the logger call to
remain in the size branch and keep the original exception propagated in the
non-size branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hthadicherla is this comment valid?

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.32%. Comparing base (00fa5bd) to head (b2c31e4).

Files with missing lines Patch % Lines
modelopt/onnx/utils.py 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1058      +/-   ##
==========================================
+ Coverage   70.30%   70.32%   +0.02%     
==========================================
  Files         227      227              
  Lines       25847    25845       -2     
==========================================
+ Hits        18172    18176       +4     
+ Misses       7675     7669       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants