-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Problem Statement
While analyzing the memory-consumption of the Stable Diffusion demo on a dGPU system, I observed that the CPU memory used during graph building is more than 3x the actual model size. Similar high-memory-usage behavior (requiring multiple times the model size) is also observed on iGPU systems, although the exact multiplier may differ.
While this might be irrelevant for smaller models, it presents a significant overhead for larger ones. For example:
- A 4GB model can consume over 12GB of CPU memory during its load phase on a dGPU.
- This is already critical on systems with 16GB of total memory.
This issue is even more pronounced on UVM (Unified Memory) systems where the CPU and GPU/NPU share memory. This high peak usage during loading unnecessarily consumes memory that could otherwise be allocated for the model weights and intermediate buffers on the device.
Proposal
I'd like to propose introducing an API that splits graph creation from weight passing (i.e., loading the constant data).
The primary goal is to enable streaming weights directly into the graph during initialization. This would avoid creating copies of the weights in JavaScript or in the backend's staging memory before the graph is fully built. Ideally, the weight data would only be copied once, from its source buffer directly to its final (e.g., device) location when the data is provided to the graph after it has been built.
This raises a key question for discussion:
- Would all current WebNN backends be able to support a (mostly) "weightless" graph creation, where all tensor shapes and data types are known, but the actual weight data is not provided until a later step?
Potential Implementation Idea
One potential solution could look like this:
- Allow
builder.constant()to be called with just anMLOperandDescriptor(defining the shape and data type), without requiring theArrayBufferViewdata source. - This call would return an
MLOperandobject that acts as a handle for this "hollow constant" (a tensor without its data). - Allow
builder.build()to succeed usingMLOperands that represent these hollow constants, creating the executableMLGraph. - Introduce a new method on the
MLGraphinterface, such as:
graph.setConstantData(constantOperand, dataBuffer) - After the
MLGraphis built (step 3), the user must call this newsetConstantData()method for every hollow constant they created. At this stage, the backend can perform any required conversions and transfer thedataBufferdirectly to device memory. context.dispatch()would require all constants to be set. Attempting to calldispatch()before all hollow constants have been supplied with data (viasetConstantData) would result in an error.
Expected Benefits
This change would significantly reduce peak memory pressure during initialization:
- dGPU Systems: In an ideal scenario, this could limit the peak CPU memory overhead to 1x-2x the size of the largest single tensor (for temporary buffering during upload), rather than 3x the entire model.
- iGPU/UVM Systems: The hope is that no temporary CPU-side storage would be needed for the "upload" (as it's shared memory). This would reduce the total peak CPU memory consumption down to roughly
Model Size + Max Single Tensor Size.