Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage

### Problem Statement

While analyzing the memory-consumption of the Stable Diffusion demo on a **dGPU system**, I observed that the CPU memory used during graph building is **more than 3x the actual model size**. Similar high-memory-usage behavior (requiring multiple times the model size) is also observed on **iGPU systems**, although the exact multiplier may differ.

While this might be irrelevant for smaller models, it presents a significant overhead for larger ones. For example:
* A **4GB model** can consume over **12GB of CPU memory** during its load phase on a dGPU.
* This is already critical on systems with 16GB of total memory.

This issue is even more pronounced on **UVM (Unified Memory) systems** where the CPU and GPU/NPU share memory. This high peak usage during loading unnecessarily consumes memory that could otherwise be allocated for the model weights and intermediate buffers on the device.

---

### Proposal

I'd like to propose introducing an API that **splits graph creation from weight passing** (i.e., loading the constant data).

The primary goal is to enable **streaming weights directly into the graph** during initialization. This would avoid creating copies of the weights in JavaScript or in the backend's staging memory *before* the graph is fully built. Ideally, the weight data would only be copied once, from its source buffer directly to its final (e.g., device) location **when the data is provided to the graph after it has been built**.

This raises a key question for discussion:
* Would all current WebNN backends be able to support a (mostly) "weightless" graph creation, where all tensor shapes and data types are known, but the actual weight data is not provided until a later step?

---

### Potential Implementation Idea

One potential solution could look like this:

1.  Allow `builder.constant()` to be called with just an `MLOperandDescriptor` (defining the shape and data type), without requiring the `ArrayBufferView` data source.
2.  This call would return an `MLOperand` object that acts as a handle for this **"hollow constant"** (a tensor without its data).
3.  Allow `builder.build()` to succeed using `MLOperand`s that represent these **hollow constants**, creating the executable `MLGraph`.
4.  Introduce a new method on the `MLGraph` interface, such as:
    `graph.setConstantData(constantOperand, dataBuffer)`
5.  After the `MLGraph` is built (step 3), the user must call this new `setConstantData()` method for every hollow constant they created. At this stage, the backend can perform any required conversions and transfer the `dataBuffer` directly to device memory.
6.  **`context.dispatch()`** would require all constants to be set. Attempting to call `dispatch()` before all hollow constants have been supplied with data (via `setConstantData`) would result in an error.

---

### Expected Benefits

This change would significantly reduce peak memory pressure during initialization:

* **dGPU Systems:** In an ideal scenario, this could limit the peak CPU memory overhead to 1x-2x the size of the *largest single tensor* (for temporary buffering during upload), rather than 3x the *entire model*.
* **iGPU/UVM Systems:** The hope is that no temporary CPU-side storage would be needed for the "upload" (as it's shared memory). This would reduce the total peak CPU memory consumption down to roughly **`Model Size + Max Single Tensor Size`**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage #901

Problem Statement

Proposal

Potential Implementation Idea

Expected Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage #901

Description

Problem Statement

Proposal

Potential Implementation Idea

Expected Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions