Skip to content

Conversation

@jdemeule
Copy link

@jdemeule jdemeule commented Dec 8, 2025

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

Before #15906, llama-bench on gemma 3 give me this kind of result:

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         48.72 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          5.95 ± 0.00 |

build: 33daece86 (6440)

So above 45t/s on pp test, and more than 5t/s on tg test.

After #15906, pp test has improved but tg test has been divided by 2.

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         60.66 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          2.84 ± 0.00 |

build: 0f0a3c285 (6441)

Launching the benchmark with "Metal System Trace" in Instruments.app, reveals some usage of the DMA1 channel which introduced lot of latency (at least, this is how I interpreted it).

With this PR, the performance are back as before on eGPU and should not impact any other configuration (dGPU and M1-M5).

# ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         47.24 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          6.07 ± 0.00 |

build: b0db6483b (7327)

With ggml-org#15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.
@jdemeule jdemeule requested a review from ggerganov as a code owner December 8, 2025 18:04
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 8, 2025
@ggerganov
Copy link
Member

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

@taronaeo
Copy link
Collaborator

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

Looks like it, and an external GPU connected via thunderbolt.

https://support.apple.com/en-sg/102363

@jdemeule
Copy link
Author

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

Yes, this is specific to Intel Mac when desktop GPU are plugged behind Thunderbolt.
I've tried to make the smallest modification as possible without introducing deprecated API or any specific code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants