[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor by gaugarg-nv · Pull Request #22950 · ggml-org/llama.cpp

gaugarg-nv · 2026-05-11T16:59:24Z

Overview

First, reduce the context length, then reduce n_gpu_layers.

Meta backend expects the full decoder layer to be either on CPU or GPU, so skipping the partial-layer tensor_buft_overrides patterns.

Future work:

This PR assumes homogeneous GPUs. For heterogeneous GPUs, we also need to support tensor_split, similar to -sm layer.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, to understand the -fit on code.

JohannesGaessler

I don't think this is going to work correctly; how exactly did you do the testing? The problem with the meta backend is that it is reporting the memory use incorrectly so -fit will not reduce the context size by the right amount.

gaugarg-nv · 2026-05-12T08:46:27Z

I don't think this is going to work correctly; how exactly did you do the testing? The problem with the meta backend is that it is reporting the memory use incorrectly so -fit will not reduce the context size by the right amount.

I tested this on two 5090 GPUs using llama-fit-params and llama-cli, varying -fitt across dense and MoE models. With sufficiently large -fitt values, the context is reduced, or both the context and -ngl are reduced.

My understanding is that the meta backend returns aggregated total and free memory across simple devices. Based on this, the approach should work when all simple devices are homogeneous and have identical total and free memory.

If this assumption does not hold, the approach will break, and we would need to implement tensor_split, similar to -sm layer. While not ideal, an alternative is to check free memory on each device and report aggregate free memory as n_devices × min(free memory across all devices). This ensures that the maximum memory allocated on each device does not exceed the overall minimum free memory. This way, splits will be uniform across all devices.

gaugarg-nv · 2026-05-12T14:15:05Z

Hi @JohannesGaessler, Any suggestions on the approach? Am I missing some other limitation that may cause this to fail on homogeneous GPUs?

Regarding heterogeneous GPUs, we can try calculating tensor_splits, but it will require exposing new APIs to ggml-backend to query number of simple devices and their total/free memory.

JohannesGaessler · 2026-05-12T14:38:57Z

IIRC the returned memory by the meta device is currently the memory per device rather than the total memory, that's why the calculation of the context size should be somewhat correct but ultimately wrong. More generally, the point of --fit is that it's supposed to be robust as it is intended as the default for naive users. Any -sm tensor support that is not a hard error should be working properly from the initial PR.

gaugarg-nv · 2026-05-12T15:41:37Z

IIRC the returned memory by the meta device is currently the memory per device rather than the total memory

Are you referring to this part of the code or something else: https://github.com/ggml-org/llama.cpp/blob/927dada6c9143ba4c940b72004d2698fa5e4e930/ggml/src/ggml-backend-meta.cpp#L98:L109

Enable Auto parameter fitting in split-mode tensor

f46c497

gaugarg-nv requested a review from JohannesGaessler as a code owner May 11, 2026 16:59

JohannesGaessler reviewed May 11, 2026

View reviewed changes

github-actions Bot added the documentation Improvements or additions to documentation label May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950

[Tensor Parallel] Enable Auto parameter fitting in split-mode tensor#22950
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:tp_fit_support

gaugarg-nv commented May 11, 2026 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

gaugarg-nv commented May 12, 2026 •

edited

Loading

Uh oh!

gaugarg-nv commented May 12, 2026

Uh oh!

JohannesGaessler commented May 12, 2026

Uh oh!

gaugarg-nv commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gaugarg-nv commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

gaugarg-nv commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaugarg-nv commented May 12, 2026

Uh oh!

JohannesGaessler commented May 12, 2026

Uh oh!

gaugarg-nv commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaugarg-nv commented May 11, 2026 •

edited

Loading

gaugarg-nv commented May 12, 2026 •

edited

Loading