ik_llama.cpp vs ZenDNN #1028

ikawrakow · 2025-12-02T15:57:06Z

ikawrakow
Dec 2, 2025
Maintainer

I haven't been contributing to llama.cpp since I left the project in March of 2024, but apparently I'm still one of the top contributors there 20 months later, so got pinged in an RFC and subsequent PR that integrates ZenDNN into llama.cpp. ZenDNN is a matrix multiplication library specifically optimized for AMD CPUs. It supports bf16 and f32 GEMM. I haven't put a lot of effort into optimizing inference with floating point models (for me "Inference at the Edge" basically means using quantized models), so I decided to check if this could be something for ik_llama.cpp to handle bf16 and f32 models.

The RFC and PR provide benchmark results for a big-iron, 96-core Zen4 CPU. I don't have that, but I do have a 16-core Ryzen-7950X, which is also Zen4, so ZenDNN should be optimized for it.

So, pulled and built the PR (it required a minor modification in the CMakeLists.txt file) and here is what we get with llama-bench on the 7950X for bf16 LlaMA-3-8B

model	size	params	backend	threads	test	t/s
llama 8B BF16	14.96 GiB	8.03 B	ZenDNN	16	pp512	218.88 ± 1.06
llama 8B BF16	14.96 GiB	8.03 B	ZenDNN	2	tg128	2.02 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	ZenDNN	4	tg128	2.08 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	ZenDNN	8	tg128	2.19 ± 0.01
llama 8B BF16	14.96 GiB	8.03 B	ZenDNN	16	tg128	2.57 ± 0.01

I used as recommended ZENDNNL_MATMUL_ALGO=2. The default (whatever it is), gives a PP performance of 163 t/s.

In comparison, here is what we get with the llama.cpp CPU backend:

model	size	params	backend	threads	test	t/s
llama 8B BF16	14.96 GiB	8.03 B	CPU	16	pp512	113.34 ± 0.07
llama 8B BF16	14.96 GiB	8.03 B	CPU	1	tg128	3.07 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	2	tg128	3.78 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	4	tg128	4.03 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	tg128	3.95 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	16	tg128	3.94 ± 0.00

Aha. ZenDNN nearly doubles llama.cpp PP performance, but that's not really hard. TG, on the other hand, is almost 2X lower.

How does ik_llama.cpp compare? Here is what we get:

model	size	params	backend	threads	rtr	test	t/s
llama 8B BF16	14.96 GiB	8.03 B	CPU	16	1	pp512	276.45 ± 0.34
llama 8B BF16	14.96 GiB	8.03 B	CPU	1	1	tg128	3.44 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	2	1	tg128	3.95 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	4	1	tg128	3.93 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	1	tg128	3.87 ± 0.00
llama 8B BF16	14.96 GiB	8.03 B	CPU	16	1	tg128	3.92 ± 0.00

So, 1.27X ZenDNN and 2.44X llama.cpp for PP. TG is faster than llama.cpp for 1 and 2 threads, almost fully saturating BW with just 2 threads. llama.cpp somehow manages to saturate at a slight higher TG speed at 4 threads. Both are faster than ZenDNN with 16 threads with just a single thread (so more than 16X better energy efficiency when generating tokens)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ik_llama.cpp vs ZenDNN #1028

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

ik_llama.cpp vs ZenDNN #1028

Uh oh!

ikawrakow Dec 2, 2025 Maintainer

Replies: 0 comments

ikawrakow
Dec 2, 2025
Maintainer