Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions CHANGELOG-Windows.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
NVIDIA Model Optimizer Changelog (Windows)
==========================================

0.41 (TBD)
^^^^^^^^^^

**Bug Fixes**

- Fix ONNX 1.19 compatibility issues with CuPy during ONNX INT4 AWQ quantization. ONNX 1.19 uses ml_dtypes.int4 instead of numpy.int8 which caused CuPy failures.

**New Features**

- Add support for ONNX Mixed Precision Weight-only quantization using INT4 and INT8 precisions. Refer quantization `example for GenAI LLMs <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/onnx_ptq/genai_llm>`_.
- Add support for some diffusion models' quantization on Windows. Refer `example script <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/torch_onnx/diffusers>`_ for details.
- Add `Perplexity <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/perplexity_metrics>`_ and `KL-Divergence <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/kl_divergence_metrics>`_ accuracy benchmarks.

0.33 (2025-07-21)
^^^^^^^^^^^^^^^^^

Expand All @@ -25,8 +38,8 @@ NVIDIA Model Optimizer Changelog (Windows)

- This is the first official release of Model Optimizer for Windows
- **ONNX INT4 Quantization:** :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models.
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_
- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`DirectML_Deployment`.
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `Olive example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_.
- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`Onnxruntime_Deployment` deployment guide for details.
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
.. _DirectML_Deployment:
.. _Onnxruntime_Deployment:

===================
DirectML
===================
===========
Onnxruntime
===========

Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.

Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
ONNX Runtime uses execution providers (EPs) to run models efficiently across a range of backends, including:

- **CUDA EP:** Utilizes NVIDIA GPUs for fast inference with CUDA and cuDNN libraries.
- **DirectML EP:** Enables deployment on a wide range of GPUs.
- **TensorRT-RTX EP:** Targets NVIDIA RTX GPUs, leveraging TensorRT for further optimized inference.
- **CPU EP:** Provides a fallback to run inference on the system's CPU when specialized hardware is unavailable.

Choose the EP that best matches your model, hardware and deployment requirements.

.. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models.

Expand All @@ -21,6 +29,10 @@ ONNX Runtime GenAI offers a streamlined solution for deploying generative AI mod
- **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control.
- **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications.

.. note::

ONNX Runtime GenAI models are typically tied to the execution provider (EP) they were built with; a model exported for one EP (e.g., CUDA or DirectML) is generally not compatible with other EPs. To run inference on a different backend, re-export or convert the model specifically for that target EP.

**Getting Started**:

Refer to the `ONNX Runtime GenAI documentation <https://onnxruntime.ai/docs/genai/>`_ for an in-depth guide on installation, setup, and usage.
Expand All @@ -42,4 +54,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio
Collection of optimized ONNX models
===================================

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. Follow the instructions provided along with the published models for deployment.
2 changes: 1 addition & 1 deletion docs/source/getting_started/1_overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ is a library comprising state-of-the-art model optimization techniques including
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.

For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/README.md>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.

Model Optimizer for both Linux and Windows are available for free for all developers on `NVIDIA PyPI <https://pypi.org/project/nvidia-modelopt/>`_. Visit the `Model Optimizer GitHub repository <https://github.com/NVIDIA/Model-Optimizer>`_ for end-to-end
example scripts and recipes optimized for NVIDIA GPUs.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The following system requirements are necessary to install and use Model Optimiz
+-------------------------+-----------------------------+

.. note::
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8.
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8+.
- We currently support *Single-GPU* configuration.

The Model Optimizer - Windows can be used in following ways:
Expand Down
31 changes: 13 additions & 18 deletions docs/source/getting_started/windows/_installation_standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Before using ModelOpt-Windows, the following components must be installed:
- NVIDIA GPU and Graphics Driver
- Python version >= 3.10 and < 3.13
- Visual Studio 2022 / MSVC / C/C++ Build Tools
- CUDA Toolkit, CuDNN for using CUDA path during calibration (e.g. for calibration of ONNX models using `onnxruntime-gpu` or CUDA EP)

Update ``PATH`` environment variable as needed for above prerequisites.

Expand All @@ -26,45 +27,38 @@ It is recommended to use a virtual environment for managing Python dependencies.
$ python -m venv .\myEnv
$ .\myEnv\Scripts\activate

In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt) will be pre-installed.
In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt etc.) will be pre-installed.

**3. Install ModelOpt-Windows Wheel**

To install the ModelOpt-Windows wheel, run the following command:
To install the ONNX module of ModelOpt-Windows, run the following command:

.. code-block:: bash

pip install "nvidia-modelopt[onnx]"

This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
If you install ModelOpt-Windows without the extra ``[onnx]`` option, only the minimal core dependencies and the PyTorch module (``torch``) will be installed. Support for ONNX model quantization requires installing with ``[onnx]``.

**4. Setup ONNX Runtime (ORT) for Calibration**
**4. ONNX Model Quantization: Setup ONNX Runtime Execution Provider for Calibration**

The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
The Post-Training Quantization (PTQ) process for ONNX models usually involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:

- *onnxruntime-directml* provides the DirectML EP.
- *onnxruntime-trt-rtx* provides TensorRT-RTX EP.
- *onnxruntime-gpu* provides the CUDA EP.
- *onnxruntime* provides the CPU EP.

By default, ModelOpt-Windows installs *onnxruntime-directml* and uses the DirectML EP (v1.20.0) for calibration. No additional dependencies are required.
If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxruntime-directml* package and install the *onnxruntime-gpu* package, which requires CUDA and cuDNN dependencies:

- Uninstall *onnxruntime-directml*:

.. code-block:: bash

pip uninstall onnxruntime-directml
By default, ModelOpt-Windows installs *onnxruntime-gpu*. The default CUDA version needed for *onnxruntime-gpu* since v1.19.0 is 12.x. The *onnxruntime-gpu* package (i.e. CUDA EP) has CUDA and cuDNN dependencies:

- Install CUDA and cuDNN:
- For the ONNX Runtime GPU package, you need to install the appropriate version of CUDA and cuDNN. Refer to the `CUDA Execution Provider requirements <https://onnxruntime.ai/docs/install/#cuda-and-cudnn/>`_ for compatible versions of CUDA and cuDNN.

- Install ONNX Runtime GPU (CUDA 12.x):
If you need to use any other EP for calibration, you can uninstall the existing *onnxruntime-gpu* package and install the corresponding package. For example, to use the DirectML EP, you can uninstall the existing *onnxruntime-gpu* package and install the *onnxruntime-directml* package:

.. code-block:: bash

pip install onnxruntime-gpu

- The default CUDA version for *onnxruntime-gpu* since v1.19.0 is 12.x.
pip uninstall onnxruntime-gpu
pip install onnxruntime-directml

**5. Setup GPU Acceleration Tool for Quantization**

Expand All @@ -75,8 +69,9 @@ By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ t
Ensure the following steps are verified:
- **Task Manager**: Check that the GPU appears in the Task Manager, indicating that the graphics driver is installed and functioning.
- **Python Interpreter**: Open the command line and type python. The Python interpreter should start, displaying the Python version.
- **Onnxruntime Package**: Ensure that one of the following is installed:
- **Onnxruntime Package**: Ensure that exactly one of the following is installed:
- *onnxruntime-directml* (DirectML EP)
- *onnxruntime-trt-rtx* (TensorRT-RTX EP)
- *onnxruntime-gpu* (CUDA EP)
- *onnxruntime* (CPU EP)
- **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully.
Expand Down
10 changes: 5 additions & 5 deletions docs/source/getting_started/windows/_installation_with_olive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Install ModelOpt-Windows with Olive
===================================

ModelOpt-Windows can be installed and used through Olive to quantize Large Language Models (LLMs) in ONNX format for deployment with DirectML. Follow the steps below to configure Olive for use with ModelOpt-Windows.
ModelOpt-Windows can be installed and used through Olive to perform model optimization using quantization technique. Follow the steps below to configure Olive for use with ModelOpt-Windows.

Setup Steps for Olive with ModelOpt-Windows
-------------------------------------------
Expand All @@ -17,7 +17,7 @@ Setup Steps for Olive with ModelOpt-Windows

pip install olive-ai[nvmo]

- **Install Prerequisites:** Ensure all required dependencies are installed. Use the following commands to install the necessary packages:
- **Install Prerequisites:** Ensure all required dependencies are installed. For example, to use DirectML Execution-Provider (EP) based onnxruntime and onnxruntime-genai packages, run the following commands:

.. code-block:: shell

Expand All @@ -31,11 +31,11 @@ Setup Steps for Olive with ModelOpt-Windows
**2. Configure Olive for Model Optimizer – Windows**

- **New Olive Pass:** Olive introduces a new pass, ``NVModelOptQuantization`` (or “nvmo”), specifically designed for model quantization using Model Optimizer – Windows.
- **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example].
- **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `this <https://github.com/microsoft/Olive/blob/main/docs/source/features/quantization.md#nvidia-tensorrt-model-optimizer-windows>`_ guide for details about this pass.].

**3. Setup Other Passes in Olive Configuration**

- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example]
- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model.

**4. Install other dependencies**

Expand All @@ -62,4 +62,4 @@ Setup Steps for Olive with ModelOpt-Windows
**Note**:

#. Currently, the Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow.
#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example.
#. To get started with Olive, refer to the official `Olive documentation <https://microsoft.github.io/Olive/>`_.
Loading