[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950
[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950gaugarg-nv wants to merge 1 commit into
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
I don't think this is going to work correctly; how exactly did you do the testing? The problem with the meta backend is that it is reporting the memory use incorrectly so -fit will not reduce the context size by the right amount.
I tested this on two 5090 GPUs using My understanding is that the meta backend returns aggregated total and free memory across simple devices. Based on this, the approach should work when all simple devices are homogeneous and have identical total and free memory. If this assumption does not hold, the approach will break, and we would need to implement |
|
Hi @JohannesGaessler, Any suggestions on the approach? Am I missing some other limitation that may cause this to fail on homogeneous GPUs? Regarding heterogeneous GPUs, we can try calculating |
|
IIRC the returned memory by the meta device is currently the memory per device rather than the total memory, that's why the calculation of the context size should be somewhat correct but ultimately wrong. More generally, the point of |
Are you referring to this part of the code or something else: https://github.com/ggml-org/llama.cpp/blob/927dada6c9143ba4c940b72004d2698fa5e4e930/ggml/src/ggml-backend-meta.cpp#L98:L109 |
Overview
First, reduce the context length, then reduce
n_gpu_layers.Meta backend expects the full decoder layer to be either on CPU or GPU, so skipping the partial-layer tensor_buft_overrides patterns.
Future work:
This PR assumes homogeneous GPUs. For heterogeneous GPUs, we also need to support
tensor_split, similar to-sm layer.Requirements
-fit oncode.