rpc : fix alloc size logic #17116

ggerganov · 2025-11-09T11:28:17Z

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

Check performance imapct
Cache the responses to avoid extra RPC calls?

ggerganov · 2025-11-10T13:40:32Z

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

rgerganov

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

rgerganov · 2025-11-10T13:54:26Z

ggml/include/ggml-rpc.h

@@ -8,4 +7,4 @@
 #endif

 #define RPC_PROTO_MAJOR_VERSION    3


let's bump the protocol version as this is a breaking change

ggml/src/ggml-metal/ggml-metal-ops.cpp

ggerganov · 2025-11-10T14:20:28Z

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

rgerganov · 2025-11-10T14:33:47Z

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?
LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

Yes, disabling graph reuse results into a lot of RPC_CMD_GET_ALLOC_SIZE calls for each graph computation

ggerganov · 2025-11-10T14:36:18Z

Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP.

ggerganov · 2025-11-10T19:35:02Z

Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge.

jukofyork · 2025-11-10T19:59:39Z

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

No problem, but it will likely be Thursday before I can run any tests.

slaren · 2025-11-11T10:25:15Z

I think eventually the proper solution to this and other per-tensor calls such as init_tensor will be to batch all the tensors into a single a call. For example, ggml-alloc could be modified to build a list of tensors and obtain all the tensor alloc sizes first, and use that data rather than calling get_alloc_size every time. This way, instead of thousands of calls (each with a network roundtrip), it would only require one. The RPC backend could also cache the results so that calls with identical tensors don't require the network at all.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 9, 2025

DajanaV mentioned this pull request Nov 9, 2025

UPSTREAM PR #17116: rpc : fix alloc size logic auroralabs-loci/llama.cpp#145

Closed

2 tasks

rgerganov reviewed Nov 10, 2025

View reviewed changes

ggerganov marked this pull request as ready for review November 10, 2025 19:34

ggerganov requested a review from slaren as a code owner November 10, 2025 19:34

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 493d0bd to 06116a0 Compare November 14, 2025 12:37

rgerganov mentioned this pull request Nov 15, 2025

Eval bug: problem with llama serve and rpc #17274

Closed

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 06116a0 to 590a805 Compare November 18, 2025 09:26

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17116: rpc : fix alloc size logic auroralabs-loci/llama.cpp#262

Open

2 tasks

ggerganov added 2 commits November 28, 2025 17:34

rpc : fix alloc size logic

7bf0bd1

rpc : bump version

4953693

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 590a805 to 4953693 Compare November 28, 2025 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc : fix alloc size logic #17116

rpc : fix alloc size logic #17116

Uh oh!

ggerganov commented Nov 9, 2025 •

edited

Loading

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov left a comment

Uh oh!

rgerganov Nov 10, 2025

Uh oh!

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

jukofyork commented Nov 10, 2025

Uh oh!

slaren commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rpc : fix alloc size logic #17116

Are you sure you want to change the base?

rpc : fix alloc size logic #17116

Uh oh!

Conversation

ggerganov commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov left a comment

Choose a reason for hiding this comment

Uh oh!

rgerganov Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

jukofyork commented Nov 10, 2025

Uh oh!

slaren commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Nov 9, 2025 •

edited

Loading