Tensor parallel #979

ikawrakow · 2025-11-18T07:49:52Z

ikawrakow
Nov 18, 2025
Maintainer

I want to start looking into implementing tensor parallel (TP) CUDA inference in ik_llama.cpp.

I decided that my first step is to see what is the current status in ik_llama.cpp and llama.cpp. Split mode "row" is the option that is currently supposed to achieve at least some tensor parallelism. Here is what I find for `-sm row"

ik_llama.cpp is not functional with -sm row at all (it crashes). With -no-fug it does run, but the result is wrong.
llama.cpp works, but as far as I can tell, it is much slower than split mode "layer".

Here is an example on a 2x3090 system for Q4_0 quantized 8B parameter dense model:

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_0	4.21 GiB	8.03 B	CUDA	99	row	1	pp1024	2161.61 ± 14.69
llama 8B Q4_0	4.21 GiB	8.03 B	CUDA	99	row	1	tg128	72.08 ± 0.41
llama 8B Q4_0	4.21 GiB	8.03 B	CUDA	99	layer	1	pp1024	7854.51 ± 30.97
llama 8B Q4_0	4.21 GiB	8.03 B	CUDA	99	layer	1	tg128	170.52 ± 0.14

I do remember some people claiming that split mode "row" leads to performance gains in llama.cpp, so I'm wondering if I'm doing something wrong. My command line is

./bin/llama-bench -m $model -p 1024 -n 128 -fa 1 -sm row

I have tried adding -ts 50/50, but this does not change the result.

So, my question is: am I missing something?

Ph0rk0z · 2025-11-19T16:39:58Z

Ph0rk0z
Nov 19, 2025

Yea, mainline's implementation kinda sucks compared to vllm or exllama. But under row I get 50% gpu utilization and under layer I only get 25%.

        CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
    -m Agatha-111B-v1-Q4_K_L/Agatha-111B-v1-Q4_K_L-00001-of-00002.gguf \
    -t 48 \
    -c 32768 \
    -ts 24,24,24,24 \
    -ngl 99 \
    -ctk q8_0 \
    -ctv q8_0 \
    -fa on \
    --no-mmap \
    -sm layer or row

Row

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	4.078	125.56	8.086	15.83
512	128	512	4.092	125.11	8.227	15.56
512	128	1024	4.103	124.77	8.303	15.42
512	128	1536	4.120	124.28	8.411	15.22
512	128	2048	4.133	123.89	8.524	15.02
512	128	2560	4.145	123.53	8.612	14.86
512	128	3072	4.158	123.13	8.692	14.73
512	128	3584	4.173	122.69	8.762	14.61
512	128	4096	4.186	122.32	8.883	14.41

Layer

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.231	416.00	9.862	12.98
512	128	512	1.245	411.26	10.003	12.80
512	128	1024	1.257	407.32	10.075	12.70
512	128	1536	1.271	402.93	10.184	12.57
512	128	2048	1.287	397.71	10.300	12.43
512	128	2560	1.301	393.61	10.395	12.31
512	128	3072	1.313	389.84	10.487	12.21
512	128	3584	1.327	385.75	10.553	12.13
512	128	4096	1.341	381.89	10.666	12.00

Context should be lower but not this low. For a similar model like mistral-large 5bit, I get:

230 tokens generated in 20.09 seconds (Queue:
0.0 s, Process: 1530 cached tokens and 2216 new tokens at 320.23 T/s, Generate: 
17.47 T/s, Context: 3746 tokens)

This is at least 80-90% utilization on all GPU. Mind you, I disable turbo mode on my 3090s so they only get up to 250w for LLM. usually sit around 200w a piece. Uncorked, the numbers might go higher.

0 replies

ikawrakow · 2025-11-19T17:01:48Z

ikawrakow
Nov 19, 2025
Maintainer Author

OK, so with 4 GPUs you get 3X+ lower PP, and 20% better TG. That's still a massively worse performance on my book. Somewhere I read that with a "typical" LLM usage the prompt length is about 10 times the number of generated tokens. So, let's apply this here for a prompt of 10,000 tokens and 1000 generated tokens. That will take

sm layer: 10,000/416 + 1000/13 =100.9 seconds
sm row: 10,000/126+1000/15.8=142.7 seconds

You objected to me showing results for an 8B model. Here are results for Q4_0 quantized 70B LlaMA with today's llama.cpp

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q4_0	37.22 GiB	70.55 B	CUDA	100	row	pp512	413.32 ± 3.41
llama 70B Q4_0	37.22 GiB	70.55 B	CUDA	100	row	tg128	18.92 ± 0.01
llama 70B Q4_0	37.22 GiB	70.55 B	CUDA	100	layer	pp512	678.45 ± 1.24
llama 70B Q4_0	37.22 GiB	70.55 B	CUDA	100	layer	tg128	21.65 ± 0.01

So, with 2 GPU's, even a 70B dense model is worse with split mode "row". It looks even worse for a MoE model, but I don't want to fill up this discussion with too many comparisons. My main objective was to establish if anyone has gotten meaningfully better performance with mainline llama.cpp using split mode "row", and so far it does not look that way.

1 reply

Ph0rk0z Nov 19, 2025

Honestly depends on your usage. I don't send a lot of 10k messages to LLMs. In back and forth chat, first you process a longer context (2-3k) and then you chat 2-300 token short messages. Higher T/G is better there. As long as yea.. it's not 3x slower PP.

Similar to how people pump up the batch size and then claim it's fast, but on shorter contexts the speeds can be just as slow or slower than leaving it at 512. I literally test this and the massive prefill from sweep bench evaporates during usage. On a lot of models I still set rtr for this reason.

There's a few things at play.

mainline's "row" mode isn't proper TP and is kinda shit. GPU utilization is still barely 50%.
parallel processing uses inter-card b/w so it will be limited by that. p2p default compile time parameter is set to go through the CPU instead of p2p at batches higher than 128 tokens. That whole "peer threshold".
Quant type and cuda version matters too. P40s would be consistently faster in sm row, 3090s maybe not.
Mainline performance on row splitting has been up and down over the years. sometimes they've completely broken it.

I don't really want to save sm_row so much as not have you come away with TP being a waste of time. In a proper TP, as I posted you get a mild decrease in PP, as in 380 vs 320 and a higher uplift in TG 17 vs 12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor parallel #979

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tensor parallel #979

Uh oh!

ikawrakow Nov 18, 2025 Maintainer

Replies: 2 comments · 1 reply

Uh oh!

Ph0rk0z Nov 19, 2025

Uh oh!

ikawrakow Nov 19, 2025 Maintainer Author

Uh oh!

Ph0rk0z Nov 19, 2025

ikawrakow
Nov 18, 2025
Maintainer

Replies: 2 comments 1 reply

Ph0rk0z
Nov 19, 2025

ikawrakow
Nov 19, 2025
Maintainer Author