Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 9, 2025

fix #16657
ref #16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 9, 2025
@ggerganov
Copy link
Member Author

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

@@ -8,4 +7,4 @@
#endif

#define RPC_PROTO_MAJOR_VERSION 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's bump the protocol version as this is a breaking change

@ggerganov
Copy link
Member Author

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

@rgerganov
Copy link
Collaborator

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

Yes, disabling graph reuse results into a lot of RPC_CMD_GET_ALLOC_SIZE calls for each graph computation

@ggerganov
Copy link
Member Author

Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP.

@ggerganov ggerganov marked this pull request as ready for review November 10, 2025 19:34
@ggerganov ggerganov requested a review from slaren as a code owner November 10, 2025 19:34
@ggerganov
Copy link
Member Author

Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge.

@jukofyork
Copy link
Collaborator

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

No problem, but it will likely be Thursday before I can run any tests.

@slaren
Copy link
Member

slaren commented Nov 11, 2025

I think eventually the proper solution to this and other per-tensor calls such as init_tensor will be to batch all the tensors into a single a call. For example, ggml-alloc could be modified to build a list of tensors and obtain all the tensor alloc sizes first, and use that data rather than calling get_alloc_size every time. This way, instead of thousands of calls (each with a network roundtrip), it would only require one. The RPC backend could also cache the results so that calls with identical tensors don't require the network at all.

@ggerganov ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 590a805 to 4953693 Compare November 28, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Incorrect outputs when running inference in multiple nodes (Mac)

5 participants