Skip to content

Conversation

@Jaswanth51
Copy link

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

adrianlizarraga and others added 30 commits January 15, 2026 21:11
### Description
- Adds API functions to get information about the subgraphs/nodes
assigned to the EPs in the session.
- `Session_GetEpGraphAssignmentInfo`: Returns a list of "subgraphs",
each with information about the assigned EP and nodes.
- Note: App must enable session configuration
`"session.record_ep_graph_assignment_info"` to signal ORT to collect
this information. If not enabled, API returns empty results.
- `EpAssignedSubgraph_GetEpName`: Returns the name of the EP to which
the subgraph is assigned
  - `EpAssignedSubgraph_GetNodes`: Returns a list of assigned nodes
  - `EpAssignedNode_GetName`: Returns the assigned node's name
  - `EpAssignedNode_GetDomain`: Returns the assigned node's domain
- `EpAssignedNode_GetOperatorType`: Returns the assigned node's operator
type
- Also adds C++ and Python bindings

#### Structure of returned information
The API returns a list of "subgraphs". Each subgraph has the following
information:
- Subgraph info:
- EP name: The name of the execution provider to which this subgraph is
assigned.
- nodes: Name and operator type of each node. Ex: `[{"multiply", "Mul"},
...]`

Python example program (taken from unit tests):
```python
    def test_get_graph_provider_assignment_info(self):
        """
        Tests querying for information about the nodes assigned to the CPU EP.
        """

        # Create session options that enables recording EP graph partitioning info.
        session_options = onnxrt.SessionOptions()
        session_options.add_session_config_entry("session.record_ep_graph_assignment_info", "1")

        session = onnxrt.InferenceSession(get_name("add_mul_add.onnx"), sess_options=session_options)

        # Query session for information on each subgraph assigned to an EP.
        ep_subgraphs = session.get_provider_graph_assignment_info()

        # Check that all 3 nodes are assigned to CPU EP (each in its own subgraph)
        self.assertEqual(len(ep_subgraphs), 3)
        for ep_subgraph in ep_subgraphs:
            self.assertEqual(ep_subgraph.ep_name, "CPUExecutionProvider")
            self.assertEqual(len(ep_subgraph.get_nodes()), 1)

        # Serialize each node to an identifier (concatenates operator type and node name)
        node_ids: list[str] = [f"{n.op_type}/{n.name}" for s in ep_subgraphs for n in s.get_nodes()]

        # Should have 1 Mul and 2 Adds.
        self.assertEqual(len(node_ids), 3)
        self.assertIn("Add/add_0", node_ids)
        self.assertIn("Add/add_1", node_ids)
        self.assertIn("Mul/mul_0", node_ids)
```

C++ program (taken from unit test):
```c++
  // Check the ep graph partitioning (Mul on plugin EP, others on CPU EP).
  // Model has 3 subgraphs (in no particular order):
  // - Subgraph 1: Add assigned to CPU EP.
  // - Subgraph 2: Mul assigned to plugin EP.
  // - Subgraph 3: Add assigned to CPU EP.
  std::vector<Ort::ConstEpAssignedSubgraph> ep_subgraphs = session.GetEpGraphAssignmentInfo();
  ASSERT_EQ(ep_subgraphs.size(), 3);

  for (Ort::ConstEpAssignedSubgraph ep_subgraph : ep_subgraphs) {
    std::string ep_name = ep_subgraph.EpName();
    ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name || ep_name == kCpuExecutionProvider);

    const std::vector<Ort::ConstEpAssignedNode> ep_nodes = ep_subgraph.GetNodes();
    ASSERT_GE(ep_nodes.size(), 1);  // All of these subgraphs just have one node.

    if (ep_name == kCpuExecutionProvider) {
      std::string op_type = ep_nodes[0].OpType();
      std::string node_name = ep_nodes[0].Name();

      ASSERT_EQ(op_type, "Add");
      ASSERT_TRUE(node_name == "add_0" || node_name == "add_1");
    } else {
      ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name);

      std::string op_type = ep_nodes[0].OpType();
      std::string node_name = ep_nodes[0].Name();
      ASSERT_EQ(op_type, "Mul");
      ASSERT_EQ(node_name, "mul_0");
    }
  }
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#27007)

### Description
This PR refines OpenVINO EP backend execution and input validation,
improving reshape handling, symbolic vs dynamic dimension checks, and
execution consistency. It also adds explicit support for
stateful/KV-cache inference by introducing cache index tracking,
validation, and reset logic across backend, context, and interface
layers, with corresponding test updates.

---------

Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Fei Chen <feich@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com>
Co-authored-by: Wang Ning <ning4.wang@intel.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com>
Co-authored-by: Ian Hunter <ianfhunter@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com>
Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
…ng (microsoft#27026)

### Description
As title

### Motivation and Context
Keep CI check happy

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description  
Add support for translation of MatMulNBits contrib op to
  QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:
 - Translate MatMulNBits to FullyConnected in OpBuilder
 - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding
params in QNN

Testing:
 - Added new unit tests for MNB -> QNN-GPU
 - Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution
flow
   - LlaMA3.2 1B
   - Qwen2.5
   - DeepSeek-R1-Qwen 1.5b
   - Phi3.5-mini-instruct

### Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with
MatMulMBits contrib ops.
To run these ops via QNN-EP, MatMulNBits is translated to QNN
FullyConnected op with INT4 weights.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
…using NEON intrinsics (microsoft#26688)

### Description

**Motivation and approach taken:**

Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
microsoft@2820a84).

The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.

**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.

**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):

<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>

### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup

For those interested,
microsoft#26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…microsoft#26060)

### Description
Small change to allow QNN Preprocess to allow a Mul node (with A=B)
instead of a Pow node (with Y=2) for layernorm fusion.
**Key changes**

This PR makes changes to improve the performance on Dynamic Qgemms by
implementing tiling and threading across operations.

The changes introduce thread local buffers for reusing memory during
inference. And utilizes those in Dynamic Quantised Matmul operations
using Kleidiai kernels.

And updating KleidiAI version to 1.15.0

**Example performance**

single thread : 
<img width="2100" height="900"
alt="ort_ops_compare_encoder_1_2025-10-02_17-21-32_vs_encoder_1_2025-10-02_16-54-55"
src="https://github.com/user-attachments/assets/c23c808d-5fab-4995-997e-a57a66a23d68"
/>

2 threads :
<img width="2100" height="900"
alt="ort_ops_compare_encoder_2_2025-10-02_17-21-47_vs_encoder_2_2025-10-02_16-55-13"
src="https://github.com/user-attachments/assets/31a0eb7a-7ff4-40c9-9425-b70231f131e8"
/>

---------

Signed-off-by: melkap01 <melike.kaptan@arm.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
### Description
As title - it looks like the duration of the job is very close to the
timeout

### Motivation and Context
Reduce retrry attempts for the ios sim job

My own PR - microsoft#26688 keep
timing out this job leg
…crosoft#26988)

### Description
Enable device stream override using RunOptions for a particular run.
The stream is restored after Run() completes.

### Motivation and Context
Gpu InterOp requirements.
When enabled, the inference would run with the specified stream with
proper synchronization with imported external synchronization
facilities.
Supported operations with QMX: SGEMM, QGEMM, Convolution
### Description
<!-- Describe your changes. -->
Use the descriptor struct in the external resource handle. 

We were copying most fields, but the setup is a little more intuitive
when the descriptor is used directly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…7034)

### Description
Add support for the QuickGELU operator in the QNN provider:
- Implement QuickGeluOpBuilder to handle QuickGELU operations
- Add registration for QuickGELU in op_builder_factory
- Add comprehensive tests for CPU and HTP backends
- Support both float and quantized (QDQ) versions

### Motivation and Context
- QNN doesn't have a direct operator to map QuickGelu so decompose it as
x * sigmoid(alpha * x) for computing the whole model on HTP to improve
inference time.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Description

This PR adds a BF16 (bfloat16) pointwise convolution kernel for ARM64
NCHWc format, leveraging the existing SBGEMM infrastructure. When the
`mlas.enable_gemm_fastmath_arm64_bfloat16` session option is enabled on
supported ARM64 Linux hardware, Pointwise Conv is rerouted to use this
BF16 implementation. This is an opt-in feature, similar to how BF16
matmul is opt-in.

Added a bool ZeroMode field to `MLAS_SBGEMM_DATA_PARAMS` (default `true`
for backward compatibility) to enable per-batch control over output
accumulation. This mirrors the beta parameter in FP32's `MlasGemmBatch`
and is required for Pointwise convolutions with >128 input channels,
where multiple GEMM calls must accumulate into the same output buffer.

## Motivation and Context

The existing `mlas.enable_gemm_fastmath_arm64_bfloat16` session option
accelerates MatMul operations on ARM64 processors with BF16 support, but
convolution operations did not benefit from this optimization. Pointwise
convolutions (1x1 kernels) are essentially batched matrix
multiplications.

This change extends the BF16 fastmath optimization to pointwise NCHWc
convolutions, reusing the same session option. The implementation
mirrors the FP32 pointwise kernel structure while delegating the actual
computation to SBGEMM, ensuring correctness and maintainability.

## Performance improvement
Measured a 15-20% gain on Mobilenet inference on an AWS Graviton4
instance.

Before (FP32)
```
/build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|0" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Number of inferences per second: 559.154
```

After (BF16)
```
./build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|1" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Number of inferences per second: 651.221

```
…microsoft#27049)

### Description
<!-- Describe your changes. -->
Fix some issues when building with the latest CUDA and cuDNN versions on
Windows.

* Latest cuDNN install has the CUDA toolkit version in the path. 
  * Adjust cmake files to support that.
* CUDA 13.x drops support for compute capability 6.0 and 7.0. 
  * Remove from CMAKE_CUDA_ARCHITECTURES.
* Remove a LINK_LANGUAGE:CUDA flag for CETCOMPAT
* Syntax doesn't seem to be supported with MSVC. Build is successful
without this (CUDA 13.1, cuDNN 9.17).
* `LINK : warning LNK4044: unrecognized option '/Xlinker=/CETCOMPAT';
ignored
[D:\src\github\ort.cuda\build\Windows.CUDA\Debug\onnxruntime_providers_cuda_ut.vcxproj]`
* Memory leak checker fixes
* onnxruntime_providers_cuda_ut was incorrectly linking against ORT
common causing a duplicate symbol when the debug leak checker is enabled
(multiple overrides of `new` and `delete`.
* As the CUDA EP is built as a separate library it shouldn't need to
link against `common`.
* Use the debug alloc/free for provider bridge when leak checker is
enabled
* Ignore EtwEventWriteNoRegistration in leak checker output as we don't
control that.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

upgrade emsdk to 4.0.23 from 4.0.21


### Motivation and Context

This version fixes a problem that fails the build under windows when
using emscan-deps.bat.
…stom ops (microsoft#27050)

### Description

The newly added two APIs, `CreateCustomOpDomains()` and
`GetNumCustomOpDomains`, are used when running inference on a model that
contains EP-specific custom operations.
  
 Workflow:
1. The EP implements these functions to supply a list of
`OrtCustomOpDomain` instances.
2. The application either 1) calls
`SessionOptionsAppendExecutionProvider_V2()` with an `OrtEpDevice`
containing
     the plugin EP's factory or 2) enables auto ep selection.
3. Then ORT either 1) `SessionOptionsAppendExecutionProvider_V2()`
appends the provided OrtCustomOpDomains to the
session options or 2) registers the OrtCustomOpDomains from the selected
EP devices.
  
As a result, any session created from these session options will have
these custom op domains registered
in ORT, ensuring that the custom ops are properly recognized and
validated when the model is loaded.

Plugin EPs can provide two types of custom ops:
   1. A full OrtCustomOp with a concrete kernel implementation
      - This Example EP demonstrates this approach.
- In GetCapability(), it calls EpGraphSupportInfo_AddSingleNode() to
inform ORT
that the custom node should NOT be fused or compiled. Instead, ORT
should invoke
        the custom node's Compute() function at runtime.

   2. A "placeholder" OrtCustomOp with an empty kernel implementation
- A compile-based Plugin EP can supply an OrtCustomOp whose
CustomKernel::Compute()
does nothing. The purpose is to satisfy model validation during model
loading by
        registering the custom op as a valid operator in the session.
- In GetCapability(), the EP should call
EpGraphSupportInfo_AddNodesToFuse() to
notify ORT that this custom node should be fused and compiled by the EP.
- In Compile(), the EP executes its compiled bits to perform inference
for
      the fused custom node.


### Motivation and Context

Currently, the provider-bridge TRT RTX EP and TRT EP supports
registering custom op domain list in session option so
that it can run model contains TRT specific custom ops.

This PR adds the same feature for plugin EP.
## Reland reason

Reland microsoft#26466

The previous PR was reverted because it fails on the test:
1. Windows GPU CUDA CI Pipeline Test Job
2. Windows GPU TensorRT CI Pipeline Test Job

This PR includes the correct
[fix](microsoft@cc7a947).

---

## Description

This pull request introduces significant improvements and expanded
support for multi-head attention kernels in ONNX Runtime, particularly
focusing on supporting both 3D (`BSNH`) and 4D (`BNSH`) QKV input
formats. The changes enhance flexibility, correctness, and
maintainability for attention operations across CPU and CUDA
implementations.

### Expanded QKV Input Format Support

* Added support for 4D QKV input format (`Q_K_V_BNSH`) in CUDA attention
kernels, including proper handling for both cases with and without
past/present states, and enforcing that bias is not supported for this
format. This includes logic to avoid unnecessary transposes and to write
outputs directly when possible.
[[1]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R264-R265)
[[2]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R343-R354)
[[3]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R388-L388)
[[4]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R426-R435)
[[5]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L673-R716)
[[6]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R747-R748)
[[7]](diffhunk://#diff-25a30e78aab7a4cdd1d6ba9f3576fc36b79dd3404225d77ea2ee0018490a83eaL775-R791)

### Kernel and Operator Documentation Updates

* Updated `OperatorKernels.md` to document the new `Attention` operator
inputs and outputs for both 3D and 4D formats, specifying supported
tensor types for each input.

### Correctness and Consistency Fixes

* Fixed the computation of causal attention indices in CUDA softmax
kernels by clarifying and correcting the offset calculation for causal
masking.
[[1]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL168-R168)
[[2]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL244-R244)
[[3]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL336-R336)
[[4]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL442-R442)
* Updated workspace allocation logic for QKV preparation to ensure
correct workspace usage for new formats.

### Attention Parameter and Helper Refactoring

* Added `is_output_bnsh` field to `AttentionParameters` to indicate
output format and updated logic to use this for output placement and
transposition decisions.
[[1]](diffhunk://#diff-e742290164e1e1fa0152840db2a1b83354e153153df19a2762b58655e49b7f9bR37)
[[2]](diffhunk://#diff-25a30e78aab7a4cdd1d6ba9f3576fc36b79dd3404225d77ea2ee0018490a83eaL775-R791)
* Refactored CPU attention implementation to use the new
`attention_helper` namespace for output mode enums and output shape
computation, improving code clarity and maintainability.
[[1]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7R5)
[[2]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7L118-R125)
[[3]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7L143-R149)

### Minor Cleanups

* Removed outdated asserts and improved debug output strings for QKV
preparation functions to clarify format and state handling.
[[1]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L254)
[[2]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L363)
[[3]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L673-R716)

These changes collectively improve the flexibility, correctness, and
maintainability of attention kernel implementations in ONNX Runtime,
especially for advanced transformer models and large language model
workloads.

**NOT supported in this PR**
- Boolean mask
- GQA
- Softcap
- Softmax precision
- qk_output_mode other than -1 and 0
- **is_causal=True && q_sequence_kength != kv_sequence_length**
### Description

Adds new pipeline for CUDA 13 Nuget builds


### Motivation and Context

The artifacts from this pipeline will be used by the release pipeline to
publish the nuget packages to our public feed.

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com>
### Description

This PR mainly modifies the followings:

- Update `Graph_GetGraphView()` implementation.
- Make sure EpGraph maintains the min/max node index, so that when
querying node outside that range, it can return null.
- Provide option to create an EpGraph that contains its parent node when
the graph is the subgraph of a control flow op.


#### Update Graph_GetGraphView() implementation
In some cases, e.g. when model has a node that produces output consumed
by multiple nodes, calling the current implementation of
`Graph_GetGraphView()` to get a subgraph returns incorrect `OrtGraph.`

- Original graph:

<img width="414" height="356" alt="image"
src="https://github.com/user-attachments/assets/739c092d-0880-4f6e-9351-e08e0e141b35"
/>


- Incorrect graph after calling `Graph_GetGraphView()` to get the
subgraph:

  It includes three of the nodes from the original graph.
The `topk_indices` is the output of the `TopK` and it shouldn't be added
as a graph input shown in the graph below.
  The API implementation has issue handling this case.
If we feed this subgraph into TRT parser, it would fail to parse the
graph.
  
<img width="349" height="341" alt="image"
src="https://github.com/user-attachments/assets/1306e22c-7c5d-45a2-bc18-6864fa2966ba"
/>

- Correct graph after calling `Graph_GetGraphView()` to get the
subgraph:

  It includes three of the nodes from the original graph.
The `topk_indices` now is not added as a graph input. Instead, the
`topk_indices` is added as a graph output which is expected as `Mod` is
in another subgraph that consumes it, so this subgraph has to make
`topk_indices` a graph output.

<img width="413" height="350" alt="image"
src="https://github.com/user-attachments/assets/b9135690-a341-41b2-9495-184030ab5cff"
/>


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->

This PR fix a bug in `im2col` for `pads` in some dimension. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…utputs unittest (microsoft#26975)

### Description
<!-- Describe your changes. -->

- Added dimension override in topk_and_multiple_graph_outputs model to
match K value

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- topk_and_multiple_graph_outputs model in TestSessionOutputs unit test
was failing due to shape constraints: TopK: length of reduction axis (N)
is smaller than K (300) Condition '==' violated: 0 != 1
### Description 
Add support for the FusedMatMul operator in the QNN execution provider.
 FusedMatMul is a contrib operator in the Microsoft domain that performs
a fused matrix multiplication with optional bias addition and
activation.

Implementation details:
- Added FusedMatMulOpBuilder class that decomposes FusedMatMul into:
  1. MatMul operation
  2. Optional bias addition
  3. Optional activation (Relu, Sigmoid, Tanh, Gelu)
- Handles various attributes: transA, transB, alpha, and activation
- Supports higher rank tensors and different data types

Added comprehensive tests:
- Basic functionality tests with various configurations
- Tests for both CPU and HTP backends
- QDQ (Quantize-Dequantize) tests for 8-bit and 16-bit precision


### Motivation and Context
Since QNN HTP doesn't support, decomposing it into QNN HTP supported
operators to improve the inference time of customer models having
FusedMatMul operator.
### Description
This test seems to be flaky and fails `Linux QNN CI Pipeline`. Disabling
this test until I figure out the root cause for the inaccuracy


### Motivation and Context
…ft#27083)

# Fix Doxygen documentation build errors from recent PRs

Fixes multiple Doxygen errors introduced by recent API additions that
cause the nightly documentation build to fail (`WARN_AS_ERROR=YES`).

## Root Cause Analysis

| Error | File | Line | Introduced By | Commit | Fix |
|-------|------|------|---------------|--------|-----|
| Duplicate `\addtogroup Global` | onnxruntime_c_api.h | 973 | PR microsoft#26828
- OrtExternalResourceImporter API | c54be3c | Remove redundant group
markers |
| Unresolved `::SetSessionLogSeverityLevel()` | onnxruntime_c_api.h |
1065 | PR microsoft#26971 - CreateEnvWithOptions API | 3874516 | Use
`OrtApi::SetSessionLogSeverityLevel` |
| Unresolved `::RunOptionsSetRunLogSeverityLevel()` |
onnxruntime_c_api.h | 1066 | PR microsoft#26971 - CreateEnvWithOptions API |
3874516 | Use `OrtApi::RunOptionsSetRunLogSeverityLevel` |
| `<ep_name>` interpreted as HTML | onnxruntime_c_api.h | 1119 | PR
microsoft#26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` |
| `\param[in] importer` not found | onnxruntime_c_api.h | 7982 | PR
microsoft#26828 - OrtExternalResourceImporter API | c54be3c | Use `\param[in]
input` (macro expands to `input`) |
| `\param[in] handle` not found | onnxruntime_c_api.h | 8025 | PR microsoft#26828
- OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` |
| `\param[in] handle` not found | onnxruntime_c_api.h | 8091 | PR microsoft#26828
- OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` |
| Unresolved `::CreateLoopKernel()` | onnxruntime_ep_c_api.h | 667 | PR
microsoft#26927 - Control flow kernels API | 1ed8fd9 | Use
`OrtEpApi::CreateLoopKernel` |
| Unresolved `::CreateScanKernel()` | onnxruntime_ep_c_api.h | 710 | PR
microsoft#26927 - Control flow kernels API | 1ed8fd9 | Use
`OrtEpApi::CreateScanKernel` |
| `<ep_name>` interpreted as HTML | onnxruntime_ep_c_api.h | 1434 | PR
microsoft#26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` |
| `\param[out] out` not found | onnxruntime_ep_c_api.h | 1440 | PR
microsoft#26971 - CreateEnvWithOptions API | 3874516 | Use `\param[out]
config_entries` |

## Summary by PR

| PR | Issues |
|----|--------|
| **microsoft#26828** (c54be3c) - OrtExternalResourceImporter API for D3D12 |
Duplicate Doxygen group, incorrect `\param` names for
`ORT_CLASS_RELEASE` macros |
| **microsoft#26927** (1ed8fd9) - Control flow kernels API | `::Method()` syntax
unresolvable by Doxygen |
| **microsoft#26971** (3874516) - CreateEnvWithOptions API | `::Method()`
syntax, `<ep_name>` HTML interpretation, incorrect param name |

## Technical Details

### `ORT_CLASS_RELEASE` Macro Issue

The `ORT_CLASS_RELEASE(X)` macro at line 164 expands to:
```cpp
void(ORT_API_CALL * Release##X)(_Frees_ptr_opt_ Ort##X * input)
```

The parameter is always named `input`, but the documentation in PR
microsoft#26828 used semantic names like `importer` and `handle`. Doxygen
validates `\param` names against actual parameter names in the expanded
code.

### Doxygen Link Resolution

Doxygen 1.9.8 cannot resolve `::MethodName()` as a link to a method. The
correct syntax is to qualify with the struct name: `OrtApi::MethodName`.

## Testing

Verified locally with Doxygen 1.9.8 (matches CI configuration).
### Description
To fix a build error for dump node inputs and outputs build option.
…27055)

### Description

move IO binding test to onnxruntime_provider_tests from
onnxruntime_test_all

### Motivation and Context

Currently, a few unit test cases are included in `onnxruntime_test_all`.
This works well until we want to support WebGPU EP as a plugin EP.

The main differences between `onnxruntime_test_all` and
`onnxruntime_provider_tests` is that the latter respects plugin EP
configuration. The IO Binding test cases involve EP configuration (eg.
"enableGraphCapture" for WebGPU) so it no longer works with WebGPU EP.
### Description
Fix GPU JAR testing


### Motivation and Context
Testing JAR for GPU was missing libcustom_library.so on Linux.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Increase version number to 1.25.0.
…soft#26942)

### Description
This PR migrates the `OIHW2OHWI` Program from `Im2ColMatMul` to the
`Transpose` operator. By centralizing this logic, we leverage the
specialized shader to optimize generic 4D transpositions (specifically
the {0, 2, 3, 1} permutation pattern) while reducing code duplication.

While this shader is capable of supporting 2D/3D transpositions, those
optimizations are reserved for follow-up PRs.

### Motivation and Context
See above.
adrastogi and others added 23 commits January 21, 2026 14:10
…el metadata (microsoft#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See microsoft#25749 for more background on the model compatibility concept and
infrastructure; microsoft#25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
…osoft#27037)

### Description
<!-- Describe your changes. -->
The current infrastructure for validating compatibility of a precompiled
model does the check after session initialization occurs, which turns
out to be quite costly. The check should ideally happen beforehand, to
short-circuit those expensive operations.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change will make it more tractable for applications to rely on the
existing session machinery to check compatibility of any of their
models.

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
### Description
- factor duplicated test target settings into helper functions
- reuse helpers for onnxruntime_test_all and onnxruntime_provider_test
- keep target-specific settings intact


### Motivation and Context

There are some duplicated codes in the onnxruntime_unittests. Originally
there is only one unit test `onnxruntime_test_all` and later it is split
into two: `onnxruntime_test_all` and `onnxruntime_provider_test`. Some
lines for setting up build flags are simply copied. This causes
potential risk for inconsistency in future.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…itional API structs. (microsoft#27100)

### Description
<!-- Describe your changes. -->

Fix OrtApi 1.24 API size static_assert violation triggered by addition
of new APIs in
microsoft@f481b17.

Add version update instructions for updating additional API structs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix build on main.

Add info about other API structs to version update instructions.
### Description
<!-- Describe your changes. -->

This change adds PCIe bus_id to the properties detected
during Linux device discovery.

This property is used to enable device discovery on Linux for the
TRT-RTX execution provider.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
I want to use device discovery for TRT-EP also on Linux.


This changes have already been tested with the newly added inference
samples
microsoft/onnxruntime-inference-examples#529 .

@gedoensmax for visibilty
Some PRs that use core/common/inlined_containers.h can cause failures in
the CUDA CI pipeline.

```
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/internal/hash.h(481): error #68-D: integer conversion resulted in a change of sign [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          sizeof(T) == -1,
                       ^
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/hash/hash.h(337): error #549-D: variable "s" is used before its value is set [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
        return s;
               ^
E:\_work\_temp\build\RelWithDebInfo\vcpkg_installed\x64-windows-static-md\include\absl/container/internal/raw_hash_set.h(468): error #69-D: integer conversion resulted in truncation [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
          static_cast<uint16_t>(reinterpret_cast<uintptr_t>(&seed));
                      ^
  3 errors detected in the compilation of "E:/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/cuda/sparse/block_mask.cu".
```

This change adds a patch to Abseil to mitigate those failures.


This solution has been verified to be effective in PR
microsoft#27087.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Support run-level profiling

This PR adds support for profiling individual Run executions, similar to
session-level profiling. Developers can enable run-level profiling by
setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once
the run completes, a JSON profiling file will be saved using
profile_file_prefix + timestamp.

<img width="514" height="281" alt="png (2)"
src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af"
/>


### Key Changes
1. Introduced a local variable `run_profiler` in
`InferenceSession::Run`, which is destroyed after the run completes.
Using a dedicated profiler per run ensures that profiling data is
isolated and prevents interleaving or corruption across runs.
2. To maintain accurate execution time when both session-level and
run-level profiling are enabled, overloaded `Start` and
`EndTimeAndRecordEvent` functions have been added. These allow the
caller to provide timestamps instead of relying on
`std::chrono::high_resolution_clock::now()`, avoiding potential timing
inaccuracies.
3. Added a TLS variable `tls_run_profiler_` to support run-level
profiling with WebGPU Execution Provider (EP). This ensures that when
multiple threads enable run-level profiling, each thread logs only to
its own WebGPU profiler, keeping thread-specific data isolated.
4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to
prevent conflicts when profiling multiple consecutive runs.

### Motivation and Context
Previously, profiling only for session level. Sometimes developer want
to profile for specfic run . so the PR comes.


### Some details

When profiling is enabled via RunOptions, it should ideally collect two
types of events:
1. Profiler events
Used to calculate the CPU execution time of each operator.
2. Execution Provider (EP) profiler events
Used to measure GPU kernel execution time. 

Unlike session-level, we need to ensure the collecting events is correct
for multiple thread scenario.

For 1, this can be supported easily(sequential_executor.cc). We use a
thread-local storage (TLS) variable, RunLevelState (defined in
profiler.h), to maintain run-level profiling state for each thread.

For 2, each Execution Provider (EP) has its own profiler implementation,
and each EP must ensure correct behavior under run-level profiling. This
PR ensures that the WebGPU profiler works correctly with run-level
profiling.

# Test Cases

| Scenario | Example | Expected Result |
|---------|---------|-----------------|
| Concurrent runs on the same session with different run-level profiling
settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2:
`sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one for
`t1` and one for `t3`. |
| Run-level profiling enabled together with session-level profiling|
`sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one
corresponding to session-level profiling and one corresponding to
run-level profiling. |
Fix [microsoft#27079](microsoft#27079) -
Qwen3 model quality regression on CUDA backend.
### Root Cause Analysis
The parity issue was caused by **buffer pointer misconfiguration** in
the GQA (Group Query Attention) QKV preprocessing pipeline. The original
implementation used multiple separate kernels for:
1. Unpacking packed QKV tensor
2. Applying RoPE (Rotary Position Embedding) to Q and K 
3. Appending K/V to cache
This multi-kernel approach created opportunities for misconfiguration:
- Buffers were allocated but not properly used
- Pointers could reference memory that was not yet allocated or
initialized
- Buffer sharing logic was fragmented across different code paths
### Solution
Consolidate QKV preprocessing into a **single fused kernel**
(`UnpackRoPEAppend`) that performs all operations in one pass:
1. **Unified kernel design**: A single kernel handles unpacking, RoPE
application, and cache append operations
2. **Simplified buffer management**: The new `PrepareQKV` function
clearly manages buffer allocation and ensures proper initialization
3. **Explicit past-to-present cache copy**: When
`past_present_share_buffer` is false, explicitly copy past KV cache to
present buffer before appending new tokens
4. **Zero-initialization for non-shared buffers**: Clear present KV
buffers when not sharing with past to ensure deterministic output
### Changes Summary
| File | Changes |
|------|---------|
|
[group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0)
| New fused `UnpackRoPEAppend` kernel with shared memory optimization
for non-interleaved RoPE |
| `group_query_attention_impl.cu` | New `PrepareQKV` helper function
that orchestrates buffer setup and kernel launch |
| `group_query_attention.cc` | Simplified operator logic by delegating
QKV prep to unified helper |
| `test_gqa.py` | Enhanced test coverage for various QKV configurations
|
### Key Improvements
- **Reduced kernel launches**: From 4-5 separate kernel calls to a
single fused kernel
- **Better memory safety**: All buffer pointers are validated in a
single location
- **Improved RoPE handling**: Uses shared memory for efficient
non-interleaved RoPE computation
- **Deterministic output**: Explicit buffer initialization ensures
consistent results across runs
- **Compatible with quantized KV cache**: The new preprocessing kernel
design supports future quantization work
### Testing
- All existing GQA unit tests pass
- Verified Qwen3 model no longer produces gibberish output
- Tested both fp16/bf16 and various head configurations
…icrosoft#24931)

## Problem

QNN error messages were being logged at VERBOSE level instead of ERROR
level, making them invisible unless verbose logging was enabled. Users
would only see unhelpful generic error messages like:

```
Failed to finalize QNN graph. Error code: 1002 at location qnn_model.cc:167 FinalizeGraphs
```

But the actual detailed error messages from QNN were hidden in verbose
logs:

```
tcm_migration.cc:2088:ERROR:Operator named q::*InputSlicePad (0x1654900000002) not sufficiently tiled to fit in TCM. Requires 12441600 bytes
graph_prepare.cc:2808:ERROR:Graph prepare TCM Migration action failed
graph_prepare.cc:2868:ERROR:Graph prepare failed during optimization with err: 17, Fatal Optimize
```

## Root Cause

The `QnnLogging` callback function in `qnn_backend_manager.cc` was
ignoring the `level` parameter from QNN and hardcoding all messages as
`kVERBOSE` severity:

```cpp
void QnnLogging(const char* format, QnnLog_Level_t level, uint64_t timestamp, va_list argument_parameter) {
  ORT_UNUSED_PARAMETER(level);  // ❌ Ignoring the actual log level
  // ...
  const auto severity = ::onnxruntime::logging::Severity::kVERBOSE;  // ❌ Hardcoded as VERBOSE
```

## Solution

Modified the `QnnLogging` function to properly map QNN log levels to
appropriate ORT severity levels:

- `QNN_LOG_LEVEL_ERROR` → `logging::Severity::kERROR` ✅ **Key fix**
- `QNN_LOG_LEVEL_WARN` → `logging::Severity::kWARNING`
- `QNN_LOG_LEVEL_INFO` → `logging::Severity::kINFO`
- `QNN_LOG_LEVEL_VERBOSE/DEBUG` → `logging::Severity::kVERBOSE`

## Changes Made

1. **Modified `QnnLogging` function**: Removed hardcoded `kVERBOSE` and
added proper level mapping
2. **Added `MapQNNLogLevelToOrtSeverity` function**: For potential
future reuse
3. **Minimal and surgical changes**: Only 37 lines added, 2 removed

## Impact

QNN error messages will now appear as ERROR-level logs in normal logging
output, making debugging much easier for users without requiring verbose
logging to be enabled.

Fixes microsoft#24876.

---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
BUG microsoft#27068

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description

Fix the bug discovered by microsoft#27014:

```
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_No_Batch_Size
SkipLayerNormTest.SkipLayerNormBatch2_Skip_Broadcast_Batch_Size_1
```
### Description  
 - Add new registerInt64Ops option to WebGpuExecutionProviderConfig
- Int64 support now enabled when enable_graph_capture OR
register_int64_ops is true
- Refactor Range kernel registration to support conditional int64
registration
  - Update kernel registry caching to handle all 4 combinations of flags
- Rename parameters from enable_graph_capture to enable_int64 for
clarity
- Add config parsing in webgpu_provider_factory.cc for registerInt64Ops
option

### Motivation
Needed by updating position id with an onnx model in genai.

Continuous decoding mode: `position_ids[i] = i + total_length -
new_kv_length`

We can use an onnx model which includes a Range op to implement update
the position ids:
Inputs: start (total_length - new_kv_length), limit (total_length),
delta (1)
    Output: position_ids (1D tensor of size new_kv_length)
* Extend compile_ep_context to also support plugin eps
* Adds compile_only option to skip execution, can be used when compiling
for virtual devices

compile_ep_context (physical device)
<img width="1259" height="510" alt="image"
src="https://github.com/user-attachments/assets/14650c17-0c8a-4002-a7ce-e8e4c815a516"
/>

compile_ep_context + compile_only (virtual device)
<img width="1262" height="173" alt="image"
src="https://github.com/user-attachments/assets/2f0844cc-5e83-4b2d-bf0a-0d815d9bad29"
/>
…oft#27070)

### Description
This PR fixes the legacy `TODO: fix the warnings` in the `Det` operator.
The arithmetic overflow warning (C26451) is addressed by using `int64_t`
for tensor dimension and batch size calculations, ensuring safe pointer
arithmetic.

### Motivation and Context
- Removes unused warning suppression pragma.
- Prevents potential overflow when handling large batches of matrices.
Added support for engine validation check for EP Context models.

### Motivation and Context
We wanted to implement the GetModelCompatibilityForEpDevices() API
support and thus have an end user available API for the engine
validation check for EP context models. Added this support and the
necessary function implementation
Fix microsoft#27125 

It does fix the build issue on Linux, but I am not entirely sure whether
this is the optimal fix.
### Description
Models with corresponding Olive recipes are deprecated.


### Motivation and Context
Olive and Olive-recipes is the entry point for model optimization. We
want onnxruntime to be only for runtime. So, deprecating examples that
are already present in olive recipes.
…t#27134)

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to
4.17.23.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a>
Bump main to v4.17.23 (<a
href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a>
fix: setCacheHas JSDoc return type should be boolean (<a
href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a>
jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a
href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a>
Prevent prototype pollution on baseUnset function</li>
<li><a
href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a>
doc: fix autoLink function, conversion of source links (<a
href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a>
chore: remove <code>yarn.lock</code> file (<a
href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a>
ci: remove legacy configuration files (<a
href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a>
feat: add renovate setup (<a
href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a>
ci: add pipeline for Bun (<a
href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a>
docs: update links related to Open JS Foundation (<a
href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Description
Enables the file mapping of weights as well as the overall context bin.
This feature is currently only enabled for ARM64 WIN devices

Motivation and Context
Currently, when reading the context bin, ORT allocates a large buffer on
the heap. Assuming the same model is used, each ORT session will
allocate a buffer for the context bin. This is incredibly wasteful when
large models are used. Instead, WIN file mapping can be leveraged to map
the context bin, then every time a context needs to be created with the
context bin, the pointer to the context bin can be retrieved and used
instead of some pre-allocated buffer, thus making QNN EP more
memory-efficient. In the case of multiple ORT sessions, the context bin
will only be loaded once for all sessions, increasing memory efficiency
and overall initialization performance. This is very useful regarding
the use of LLMs going forward.

---------

Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
@Jaswanth51 Jaswanth51 requested a review from ankitm3k January 27, 2026 03:30
Copy link

@ankitm3k ankitm3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont merge. you have removed the crucial changes. kindly rebase properly

@Jaswanth51 Jaswanth51 closed this Jan 27, 2026
@Jaswanth51 Jaswanth51 deleted the sync_msft_27012026 branch January 27, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.