Skip to content

[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950

Open
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:tp_fit_support
Open

[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:tp_fit_support

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

@gaugarg-nv gaugarg-nv commented May 11, 2026

Overview

First, reduce the context length, then reduce n_gpu_layers.

Meta backend expects the full decoder layer to be either on CPU or GPU, so skipping the partial-layer tensor_buft_overrides patterns.

Future work:

This PR assumes homogeneous GPUs. For heterogeneous GPUs, we also need to support tensor_split, similar to -sm layer.

Requirements

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to work correctly; how exactly did you do the testing? The problem with the meta backend is that it is reporting the memory use incorrectly so -fit will not reduce the context size by the right amount.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 11, 2026
@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

gaugarg-nv commented May 12, 2026

I don't think this is going to work correctly; how exactly did you do the testing? The problem with the meta backend is that it is reporting the memory use incorrectly so -fit will not reduce the context size by the right amount.

I tested this on two 5090 GPUs using llama-fit-params and llama-cli, varying -fitt across dense and MoE models. With sufficiently large -fitt values, the context is reduced, or both the context and -ngl are reduced.

My understanding is that the meta backend returns aggregated total and free memory across simple devices. Based on this, the approach should work when all simple devices are homogeneous and have identical total and free memory.

If this assumption does not hold, the approach will break, and we would need to implement tensor_split, similar to -sm layer. While not ideal, an alternative is to check free memory on each device and report aggregate free memory as n_devices × min(free memory across all devices). This ensures that the maximum memory allocated on each device does not exceed the overall minimum free memory. This way, splits will be uniform across all devices.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

Hi @JohannesGaessler, Any suggestions on the approach? Am I missing some other limitation that may cause this to fail on homogeneous GPUs?

Regarding heterogeneous GPUs, we can try calculating tensor_splits, but it will require exposing new APIs to ggml-backend to query number of simple devices and their total/free memory.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

IIRC the returned memory by the meta device is currently the memory per device rather than the total memory, that's why the calculation of the context size should be somewhat correct but ultimately wrong. More generally, the point of --fit is that it's supposed to be robust as it is intended as the default for naive users. Any -sm tensor support that is not a hard error should be working properly from the initial PR.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

IIRC the returned memory by the meta device is currently the memory per device rather than the total memory

Are you referring to this part of the code or something else: https://github.com/ggml-org/llama.cpp/blob/927dada6c9143ba4c940b72004d2698fa5e4e930/ggml/src/ggml-backend-meta.cpp#L98:L109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants