You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I haven't been contributing to llama.cpp since I left the project in March of 2024, but apparently I'm still one of the top contributors there 20 months later, so got pinged in an RFC and subsequent PR that integrates ZenDNN into llama.cpp. ZenDNN is a matrix multiplication library specifically optimized for AMD CPUs. It supports bf16 and f32 GEMM. I haven't put a lot of effort into optimizing inference with floating point models (for me "Inference at the Edge" basically means using quantized models), so I decided to check if this could be something for ik_llama.cpp to handle bf16 and f32 models.
The RFC and PR provide benchmark results for a big-iron, 96-core Zen4 CPU. I don't have that, but I do have a 16-core Ryzen-7950X, which is also Zen4, so ZenDNN should be optimized for it.
So, pulled and built the PR (it required a minor modification in the CMakeLists.txt file) and here is what we get with llama-bench on the 7950X for bf16 LlaMA-3-8B
model
size
params
backend
threads
test
t/s
llama 8B BF16
14.96 GiB
8.03 B
ZenDNN
16
pp512
218.88 ± 1.06
llama 8B BF16
14.96 GiB
8.03 B
ZenDNN
2
tg128
2.02 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
ZenDNN
4
tg128
2.08 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
ZenDNN
8
tg128
2.19 ± 0.01
llama 8B BF16
14.96 GiB
8.03 B
ZenDNN
16
tg128
2.57 ± 0.01
I used as recommended ZENDNNL_MATMUL_ALGO=2. The default (whatever it is), gives a PP performance of 163 t/s.
In comparison, here is what we get with the llama.cpp CPU backend:
model
size
params
backend
threads
test
t/s
llama 8B BF16
14.96 GiB
8.03 B
CPU
16
pp512
113.34 ± 0.07
llama 8B BF16
14.96 GiB
8.03 B
CPU
1
tg128
3.07 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
2
tg128
3.78 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
4
tg128
4.03 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
8
tg128
3.95 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
16
tg128
3.94 ± 0.00
Aha. ZenDNN nearly doubles llama.cpp PP performance, but that's not really hard. TG, on the other hand, is almost 2X lower.
How does ik_llama.cpp compare? Here is what we get:
model
size
params
backend
threads
rtr
test
t/s
llama 8B BF16
14.96 GiB
8.03 B
CPU
16
1
pp512
276.45 ± 0.34
llama 8B BF16
14.96 GiB
8.03 B
CPU
1
1
tg128
3.44 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
2
1
tg128
3.95 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
4
1
tg128
3.93 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
8
1
tg128
3.87 ± 0.00
llama 8B BF16
14.96 GiB
8.03 B
CPU
16
1
tg128
3.92 ± 0.00
So, 1.27X ZenDNN and 2.44X llama.cpp for PP. TG is faster than llama.cpp for 1 and 2 threads, almost fully saturating BW with just 2 threads. llama.cpp somehow manages to saturate at a slight higher TG speed at 4 threads. Both are faster than ZenDNN with 16 threads with just a single thread (so more than 16X better energy efficiency when generating tokens)!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I haven't been contributing to
llama.cppsince I left the project in March of 2024, but apparently I'm still one of the top contributors there 20 months later, so got pinged in an RFC and subsequent PR that integrates ZenDNN intollama.cpp. ZenDNN is a matrix multiplication library specifically optimized for AMD CPUs. It supportsbf16andf32GEMM. I haven't put a lot of effort into optimizing inference with floating point models (for me "Inference at the Edge" basically means using quantized models), so I decided to check if this could be something forik_llama.cppto handlebf16andf32models.The RFC and PR provide benchmark results for a big-iron, 96-core Zen4 CPU. I don't have that, but I do have a 16-core Ryzen-7950X, which is also Zen4, so ZenDNN should be optimized for it.
So, pulled and built the PR (it required a minor modification in the
CMakeLists.txtfile) and here is what we get withllama-benchon the 7950X forbf16LlaMA-3-8BI used as recommended
ZENDNNL_MATMUL_ALGO=2. The default (whatever it is), gives a PP performance of 163 t/s.In comparison, here is what we get with the
llama.cppCPU backend:Aha. ZenDNN nearly doubles
llama.cppPP performance, but that's not really hard. TG, on the other hand, is almost 2X lower.How does
ik_llama.cppcompare? Here is what we get:So, 1.27X ZenDNN and 2.44X
llama.cppfor PP. TG is faster thanllama.cppfor 1 and 2 threads, almost fully saturating BW with just 2 threads.llama.cppsomehow manages to saturate at a slight higher TG speed at 4 threads. Both are faster than ZenDNN with 16 threads with just a single thread (so more than 16X better energy efficiency when generating tokens)!Beta Was this translation helpful? Give feedback.
All reactions