Skip to content

Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage #901

@mtavenrath

Description

@mtavenrath

Problem Statement

While analyzing the memory-consumption of the Stable Diffusion demo on a dGPU system, I observed that the CPU memory used during graph building is more than 3x the actual model size. Similar high-memory-usage behavior (requiring multiple times the model size) is also observed on iGPU systems, although the exact multiplier may differ.

While this might be irrelevant for smaller models, it presents a significant overhead for larger ones. For example:

  • A 4GB model can consume over 12GB of CPU memory during its load phase on a dGPU.
  • This is already critical on systems with 16GB of total memory.

This issue is even more pronounced on UVM (Unified Memory) systems where the CPU and GPU/NPU share memory. This high peak usage during loading unnecessarily consumes memory that could otherwise be allocated for the model weights and intermediate buffers on the device.


Proposal

I'd like to propose introducing an API that splits graph creation from weight passing (i.e., loading the constant data).

The primary goal is to enable streaming weights directly into the graph during initialization. This would avoid creating copies of the weights in JavaScript or in the backend's staging memory before the graph is fully built. Ideally, the weight data would only be copied once, from its source buffer directly to its final (e.g., device) location when the data is provided to the graph after it has been built.

This raises a key question for discussion:

  • Would all current WebNN backends be able to support a (mostly) "weightless" graph creation, where all tensor shapes and data types are known, but the actual weight data is not provided until a later step?

Potential Implementation Idea

One potential solution could look like this:

  1. Allow builder.constant() to be called with just an MLOperandDescriptor (defining the shape and data type), without requiring the ArrayBufferView data source.
  2. This call would return an MLOperand object that acts as a handle for this "hollow constant" (a tensor without its data).
  3. Allow builder.build() to succeed using MLOperands that represent these hollow constants, creating the executable MLGraph.
  4. Introduce a new method on the MLGraph interface, such as:
    graph.setConstantData(constantOperand, dataBuffer)
  5. After the MLGraph is built (step 3), the user must call this new setConstantData() method for every hollow constant they created. At this stage, the backend can perform any required conversions and transfer the dataBuffer directly to device memory.
  6. context.dispatch() would require all constants to be set. Attempting to call dispatch() before all hollow constants have been supplied with data (via setConstantData) would result in an error.

Expected Benefits

This change would significantly reduce peak memory pressure during initialization:

  • dGPU Systems: In an ideal scenario, this could limit the peak CPU memory overhead to 1x-2x the size of the largest single tensor (for temporary buffering during upload), rather than 3x the entire model.
  • iGPU/UVM Systems: The hope is that no temporary CPU-side storage would be needed for the "upload" (as it's shared memory). This would reduce the total peak CPU memory consumption down to roughly Model Size + Max Single Tensor Size.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions