-
Notifications
You must be signed in to change notification settings - Fork 243
[5525939] Allow user to select target opset in MOQ; upgrade onnxruntime #809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -286,6 +286,15 @@ def get_parser() -> argparse.ArgumentParser: | |
| "The currently supported precisions are {fp16, int8, fp8}." | ||
| ), | ||
| ) | ||
| argparser.add_argument( | ||
| "--opset", | ||
| type=int, | ||
| help=( | ||
| "Target ONNX opset version for the quantized model. If not specified, uses default minimum opset " | ||
| "(19 for fp16 scales support, 21 for int4, 23 for nvfp4). The opset may be automatically increased " | ||
| "if certain operations require a higher version." | ||
| ), | ||
| ) | ||
|
Comment on lines
+289
to
+297
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify BF16 minimum opset in ✏️ Suggested doc tweak- "Target ONNX opset version for the quantized model. If not specified, uses default minimum opset "
- "(19 for fp16 scales support, 21 for int4, 23 for nvfp4). The opset may be automatically increased "
+ "Target ONNX opset version for the quantized model. If not specified, uses default minimum opset "
+ "(19 for fp16 scales support, 22 for bf16, 21 for int4, 23 for nvfp4). The opset may be automatically increased "
"if certain operations require a higher version."Also applies to: 364-364 🤖 Prompt for AI Agents |
||
| return argparser | ||
|
|
||
|
|
||
|
|
@@ -352,6 +361,7 @@ def main(): | |
| simplify=args.simplify, | ||
| calibrate_per_node=args.calibrate_per_node, | ||
| direct_io_types=args.direct_io_types, | ||
| opset=args.opset, | ||
| ) | ||
|
|
||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,6 +69,8 @@ | |
| ) | ||
| from modelopt.onnx.trt_utils import interpret_trt_plugins_precision_flag, load_onnx_model | ||
| from modelopt.onnx.utils import ( | ||
| BASE_MIN_OPSET, | ||
| QDQ_PRECISION_MIN_OPSET, | ||
| duplicate_shared_constants, | ||
| get_opset_version, | ||
| name_onnx_nodes, | ||
|
|
@@ -88,6 +90,7 @@ def _preprocess_onnx( | |
| override_shapes: str, | ||
| simplify: bool = False, | ||
| quantize_mode: str = "int8", | ||
| opset: int | None = None, | ||
| ) -> tuple[str, onnx.ModelProto, list[str], bool, bool, bool, dict, dict]: | ||
| logger.info(f"Preprocessing the model {onnx_path}") | ||
| intermediate_generated_files = [] | ||
|
|
@@ -118,16 +121,43 @@ def _preprocess_onnx( | |
| " '--trt_plugins' flag (requires TRT 10+)." | ||
| ) | ||
|
|
||
| # Per-Channel support with QDQ format requires onnx opset version 13 or above | ||
| opset_version = get_opset_version(onnx_model) | ||
| # Opset 19 is the minimum required for fp16 scales in Q/DQ nodes | ||
| # Higher opsets required for specific quantization modes (int4: 21, nvfp4: 23) | ||
| original_opset_version = get_opset_version(onnx_model) | ||
|
|
||
| required_opset_version = 13 | ||
| if opset_version < required_opset_version and opset_version != 1: | ||
| opset_version = required_opset_version | ||
| onnx_model = onnx.version_converter.convert_version(onnx_model, opset_version) | ||
| onnx_path = os.path.join(output_dir, f"{model_name}_opset{opset_version}.onnx") | ||
| # Determine minimum required opset based on quantization mode | ||
| mode_min_opset = QDQ_PRECISION_MIN_OPSET.get(quantize_mode, BASE_MIN_OPSET) | ||
|
|
||
|
Comment on lines
+124
to
+130
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: wc -l modelopt/onnx/quantization/quantize.pyRepository: NVIDIA/Model-Optimizer Length of output: 108 🏁 Script executed: cat -n modelopt/onnx/quantization/quantize.py | sed -n '110,150p'Repository: NVIDIA/Model-Optimizer Length of output: 2526 🏁 Script executed: rg -n "QDQ_PRECISION_MIN_OPSET|BASE_MIN_OPSET" modelopt/onnx/quantization/ -A 3Repository: NVIDIA/Model-Optimizer Length of output: 719 🏁 Script executed: rg -n '"int4"|"nvfp4"|"fp4"' modelopt/onnx/quantization/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer Length of output: 798 🏁 Script executed: rg -n 'quantize_mode.*"int4"|"int4".*in.*quantize_mode' modelopt/onnx/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer Length of output: 415 🏁 Script executed: rg -n "QDQ_PRECISION_MIN_OPSET\s*=" modelopt/onnx/ -A 10Repository: NVIDIA/Model-Optimizer Length of output: 657 🏁 Script executed: rg -n "int4_awq|int4_gptq|int4_woq|nvfp4" modelopt/onnx/Repository: NVIDIA/Model-Optimizer Length of output: 1301 🏁 Script executed: rg -n "quantize_mode" modelopt/onnx/quantization/__init__.py modelopt/onnx/quantization/__main__.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer Length of output: 1216 🏁 Script executed: rg -n "def quantize\(" modelopt/onnx/quantization/quantize.py -A 15Repository: NVIDIA/Model-Optimizer Length of output: 816 🏁 Script executed: rg -n "nvfp4_exporter|nvfp4" modelopt/onnx/ -B 3 -A 3Repository: NVIDIA/Model-Optimizer Length of output: 4595 🏁 Script executed: rg -n "int4_awq" modelopt/onnx/quantization/quantize.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: cat -n modelopt/onnx/quantization/quantize.py | sed -n '500,560p'Repository: NVIDIA/Model-Optimizer Length of output: 3232 Normalize The code at line 523 uses substring matching ( 🔧 Suggested fix- mode_min_opset = QDQ_PRECISION_MIN_OPSET.get(quantize_mode, BASE_MIN_OPSET)
+ if "int4" in quantize_mode:
+ mode_min_opset = QDQ_PRECISION_MIN_OPSET["int4"]
+ elif quantize_mode in QDQ_PRECISION_MIN_OPSET:
+ mode_min_opset = QDQ_PRECISION_MIN_OPSET[quantize_mode]
+ else:
+ mode_min_opset = BASE_MIN_OPSET🤖 Prompt for AI Agents |
||
| # Determine target opset version | ||
| if opset is not None: | ||
| target_opset = opset | ||
| # Warn if user-specified opset is below mode minimum (but still respect it) | ||
| if opset < mode_min_opset: | ||
| logger.warning( | ||
| f"Opset {opset} is below the minimum opset {mode_min_opset} required for " | ||
| f"{quantize_mode} quantization. Upgrading to opset {mode_min_opset}." | ||
| ) | ||
| target_opset = mode_min_opset | ||
| # Warn if user-specified opset is lower than original | ||
| if opset < original_opset_version: | ||
| logger.warning( | ||
| f"Specified opset {opset} is lower than the original model's opset {original_opset_version}. " | ||
| f"Using original model's opset {original_opset_version}." | ||
| ) | ||
| target_opset = max(target_opset, original_opset_version) | ||
| else: | ||
| # Use model's opset if it's >= mode_min_opset, otherwise upgrade to mode_min_opset | ||
| target_opset = ( | ||
| max(original_opset_version, mode_min_opset) | ||
| if original_opset_version != 1 | ||
| else mode_min_opset | ||
| ) | ||
|
|
||
| if original_opset_version < target_opset and original_opset_version != 1: | ||
| onnx_model = onnx.version_converter.convert_version(onnx_model, target_opset) | ||
| onnx_path = os.path.join(output_dir, f"{model_name}_opset{target_opset}.onnx") | ||
| save_onnx(onnx_model, onnx_path, use_external_data_format) | ||
| logger.info(f"Model is cloned to {onnx_path} with opset_version {opset_version}") | ||
| logger.info(f"Model is cloned to {onnx_path} with opset_version {target_opset}") | ||
| intermediate_generated_files.append(onnx_path) | ||
|
|
||
| # Simplify model if requested | ||
|
|
@@ -231,6 +261,7 @@ def quantize( | |
| calibrate_per_node: bool = False, | ||
| input_shapes_profile: Sequence[dict[str, str]] | None = None, | ||
| direct_io_types: bool = False, | ||
| opset: int | None = None, | ||
| **kwargs: Any, | ||
| ) -> None: | ||
| """Quantizes the provided ONNX model. | ||
|
|
@@ -350,6 +381,10 @@ def quantize( | |
| direct_io_types: | ||
| If True, modify the I/O types in the quantized ONNX model to be lower precision whenever possible. | ||
| If False, keep the I/O types in the quantized ONNX model the same as in the given ONNX model. | ||
| opset: | ||
| Target ONNX opset version for the quantized model. If None, uses required minimum opset | ||
| (19 for int8/fp8, 21 for int4, 23 for nvfp4). If the specified opset is lower than the required minimum, | ||
| a warning will be issued and the opset will be upgraded to the required minimum. | ||
| kwargs: | ||
| Additional keyword arguments for int4 quantization, including: | ||
| - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. | ||
|
|
@@ -420,6 +455,7 @@ def quantize( | |
| override_shapes, # type: ignore[arg-type] | ||
| simplify, | ||
| quantize_mode, | ||
| opset, | ||
| ) | ||
| trt_plugins = update_trt_ep_support(calibration_eps, has_dds_op, has_custom_op, trt_plugins) # type: ignore[arg-type] | ||
|
|
||
|
|
@@ -481,6 +517,7 @@ def quantize( | |
| calibrate_per_node=calibrate_per_node, | ||
| custom_ops_to_quantize=list(custom_ops_to_quantize.keys()), | ||
| direct_io_types=direct_io_types, | ||
| opset=opset, | ||
| **kwargs, | ||
| ) | ||
| elif "int4" in quantize_mode: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,9 @@ | |
|
|
||
| from modelopt.onnx.logging_config import logger | ||
|
|
||
| # Base minimum opset for quantization (opset 19 is the first to support fp16 scales) | ||
| BASE_MIN_OPSET = 19 | ||
|
|
||
|
|
||
| def get_input_names_from_bytes(model_bytes: bytes, external_inputs_only: bool = True) -> list[str]: | ||
| """This function returns the inputs names of the given onnx model in bytes. | ||
|
|
@@ -696,6 +699,67 @@ def get_opset_version(model: onnx.ModelProto) -> int: | |
| return ai_onnx_domain[0].version | ||
|
|
||
|
|
||
| def get_qdq_precisions(model: onnx.ModelProto) -> set: | ||
| """Gets the Q/DQ precision types present in the model. | ||
|
|
||
| Args: | ||
| model: Loaded in-memory onnx ModelProto. | ||
|
|
||
| Returns: | ||
| set: Set of Q/DQ precision types present in the model (e.g., 'float8_e4m3fn', 'int8', | ||
| 'int4', 'float4_e2m1fn'). | ||
| """ | ||
| graph = gs.import_onnx(model) | ||
| precisions = set() | ||
|
|
||
| # Check for custom 'NVFP4' nodes | ||
| custom_fp4_q_nodes = [node for node in graph.nodes if node.op == "TRT_FP4DynamicQuantize"] | ||
| if custom_fp4_q_nodes: | ||
| precisions.add("float4_e2m1fn") | ||
|
|
||
| # Check for precision in DQ nodes | ||
| dq_nodes = [node for node in graph.nodes if node.op == "DequantizeLinear"] | ||
| for dq_node in dq_nodes: | ||
| if len(dq_node.inputs) >= 3 and dq_node.inputs[2] is not None: | ||
| # If zero-point is set, return that as the quantization mode | ||
| if isinstance(dq_node.inputs[2], Constant) and dq_node.inputs[2].values is not None: | ||
| precisions.add(dq_node.inputs[2].values.dtype.name) | ||
| elif isinstance(dq_node.inputs[0], Constant) and dq_node.inputs[0].values is not None: | ||
| # Else, return the node's input precision (ex: 'NVFP4' weight quantization) | ||
| precisions.add(dq_node.inputs[0].values.dtype.name) | ||
|
|
||
| return precisions | ||
|
Comment on lines
+702
to
+731
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Locate call sites and current precision inference usage.
rg -n "get_qdq_precisions" -S
rg -n "QuantizeLinear" modelopt/onnx -SRepository: NVIDIA/Model-Optimizer Length of output: 5270 🏁 Script executed: #!/bin/bash
# Check how get_qdq_precisions result is used
sed -n '220,250p' modelopt/onnx/autocast/convert.py
# Check QuantizeLinear output_dtype handling
sed -n '915,935p' modelopt/onnx/utils.py
# Search for function that uses precisions from get_qdq_precisions
rg -n "get_min_opset_for_precisions" -A 10 modelopt/onnxRepository: NVIDIA/Model-Optimizer Length of output: 5310 🏁 Script executed: #!/bin/bash
# Read the complete get_qdq_precisions function
sed -n '702,745p' modelopt/onnx/utils.py
# Check if there are any tests for this function
find . -name "*.py" -type f | xargs grep -l "test.*get_qdq_precisions\|get_qdq_precisions.*test" 2>/dev/null | head -5
# Search for QuantizeLinear node processing with output_dtype
rg -n "output_dtype" modelopt/onnx -B 2 -A 2Repository: NVIDIA/Model-Optimizer Length of output: 4593 🏁 Script executed: #!/bin/bash
# Check if there are tests showing Q/DQ precision detection
find . -name "*test*.py" -type f | xargs grep -l "get_qdq_precisions\|QuantizeLinear.*output_dtype" 2>/dev/null
# Search for any models or test data with QuantizeLinear nodes
find . -name "*.onnx" -o -name "*quantize*test*" 2>/dev/null | head -10
# Check if Variable nodes are used in Q/DQ paths
rg -n "Variable.*QuantizeLinear|QuantizeLinear.*Variable" modelopt/onnx -B 2 -A 2
# Look for examples of Q/DQ with output_dtype attribute
rg -n "output_dtype.*int4|output_dtype.*float8|int4.*output_dtype|float8.*output_dtype" modelopt/onnx -B 3 -A 3Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: #!/bin/bash
# Search more carefully for QuantizeLinear node handling
rg -n "for.*node.*graph" modelopt/onnx/utils.py -A 5 | grep -A 5 "QuantizeLinear"
# Check what types of inputs QuantizeLinear nodes typically have
rg -n "node.op.*QuantizeLinear\|node.op_type.*QuantizeLinear" modelopt/onnx -B 3 -A 3 | head -40
# Look at the quantization module to understand Q/DQ structure
cat modelopt/onnx/quantization/fp8.py | head -150Repository: NVIDIA/Model-Optimizer Length of output: 6264 🏁 Script executed: #!/bin/bash
# Verify the actual impact by checking if output_dtype is ever set on QuantizeLinear
rg -n "TRT_INT4QuantizeLinear\|output_dtype.*=\|output_dtype.*:" modelopt/onnx/quantization -B 2 -A 2 | head -50
# Check if there are any activation quantization examples
rg -n "activation.*quantize\|activation.*Q/DQ" modelopt/onnx -i | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48
The function only checks DequantizeLinear nodes and only when inputs/zero-points are Constant types. It never processes QuantizeLinear nodes or handles Variable-typed inputs, which are typical for activations. This causes activation Q/DQ and weight QuantizeLinear with non-constant parameters to be under-reported. Since this set is used for opset version selection (e.g., int4 requires opset 21, float4_e2m1fn requires opset 23), insufficient precision detection could result in incompatible opset versions being selected. Add checks for QuantizeLinear.output_dtype attribute and consider inspecting Variables or value_info types to cover non-constant inputs. 🤖 Prompt for AI Agents |
||
|
|
||
|
|
||
| # Minimum opset requirements by quantization mode/precision | ||
| # Base minimum is 19 (first opset that allows fp16 scales in Q/DQ nodes) | ||
| # Supports both quantize modes (e.g., "fp8") and dtype prefixes (e.g., "float8" for "float8_e4m3fn") | ||
| QDQ_PRECISION_MIN_OPSET = { | ||
| "int8": BASE_MIN_OPSET, | ||
| "float8_e4m3fn": BASE_MIN_OPSET, | ||
| "int4": 21, | ||
| "uint4": 21, | ||
| "float4_e2m1fn": 23, | ||
| } | ||
|
|
||
|
|
||
| def get_min_opset_for_precisions(precisions: set) -> int: | ||
| """Gets the minimum required opset version for a set of Q/DQ precision types. | ||
|
|
||
| Args: | ||
| precisions: Set of precision type strings (e.g., 'float8_e4m3fn', 'int4'). | ||
|
|
||
| Returns: | ||
| int: Minimum required opset version for the given precisions. | ||
| """ | ||
| min_opset = BASE_MIN_OPSET # Base minimum for fp16 scales support | ||
| for precision in precisions: | ||
| # Direct lookup first | ||
| if precision in QDQ_PRECISION_MIN_OPSET: | ||
| min_opset = max(min_opset, QDQ_PRECISION_MIN_OPSET[precision]) | ||
| return min_opset | ||
|
|
||
|
|
||
| def bfloat16_to_float32(bf16_array): | ||
| """Converts a bfloat16 array (as raw data) to a float32 array.""" | ||
| uint32_array = bf16_array.astype(np.uint32) << 16 | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -48,8 +48,8 @@ | |||||||
| "onnx-graphsurgeon", | ||||||||
| "onnx~=1.19.0", | ||||||||
| "onnxconverter-common~=1.16.0", | ||||||||
| "onnxruntime~=1.22.0 ; platform_machine == 'aarch64' or platform_system == 'Darwin'", | ||||||||
| "onnxruntime-gpu~=1.22.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin'", | ||||||||
| "onnxruntime~=1.23.0 ; platform_machine == 'aarch64' or platform_system == 'Darwin'", | ||||||||
| "onnxruntime-gpu~=1.23.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin'", | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's leave windows version unchanged as they saw some regressions. cc @hthadicherla
Suggested change
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the regression was initially observed by @ajrasane . He saw that quantizing with latest ort was causing accuracy degradations in some vision models in Linux . When I tested these models later , I found the exact same regressions in windows. Better to leave it 1.22 in setup.py. In LLM quantization examples, we reinstall the latest ort version, by having ort==1.23 in requirements.txt. |
||||||||
| "onnxscript", # For autocast opset conversion and test_onnx_dynamo_export unit test | ||||||||
| "onnxslim>=0.1.76", | ||||||||
| "polygraphy>=0.49.22", | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring mismatch: FP16 minimum opset is now 19, not 13.
The implementation enforces
base_min_opset = 19for fp16, but the docstring still says 13. Please align the docstring.✏️ Suggested doc fix
📝 Committable suggestion
🤖 Prompt for AI Agents