Skip to content

Commit dfb155d

Browse files
Manoj Kumarz-vishal
authored andcommitted
ggml-zennn: add ZenDNN backend support
1 parent 03d9a77 commit dfb155d

File tree

12 files changed

+19728
-109
lines changed

12 files changed

+19728
-109
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
276276
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
277277
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
278278
| [HIP](docs/build.md#hip) | AMD GPU |
279+
| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
279280
| [Vulkan](docs/build.md#vulkan) | GPU |
280281
| [CANN](docs/build.md#cann) | Ascend NPU |
281282
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |

docs/backend/ZenDNN.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# llama.cpp for AMD ZenDNN
2+
3+
- [Background](#background)
4+
- [OS](#os)
5+
- [Hardware](#hardware)
6+
- [Supported Operations](#supported-operations)
7+
- [DataType Supports](#datatype-supports)
8+
- [Linux](#linux)
9+
- [Environment Variable](#environment-variable)
10+
- [Performance Optimization](#performance-optimization)
11+
- [Known Issues](#known-issues)
12+
- [TODO](#todo)
13+
14+
## Background
15+
16+
**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
17+
18+
**Llama.cpp + ZenDNN**
19+
20+
The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN).
21+
22+
For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
23+
24+
## OS
25+
26+
| OS | Status | Verified |
27+
|:-------:|:-------:|:----------------------------------------------:|
28+
| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
29+
30+
For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os).
31+
32+
## Hardware
33+
34+
### AMD CPUs
35+
36+
**Recommended Processors**
37+
38+
ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
39+
40+
| CPU Family | Status | Notes |
41+
|:-----------------------------:|:-------:|:----------------------------------:|
42+
| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture |
43+
| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture |
44+
| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture |
45+
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
46+
47+
*Notes:*
48+
49+
- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
50+
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
51+
- For optimal performance, ensure your system has sufficient memory bandwidth.
52+
53+
## Supported Operations
54+
55+
The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
56+
57+
| Operation | Status | Notes |
58+
|:-------------|:-------:|:----------------------------------------------:|
59+
| MUL_MAT || Accelerated via ZenDNN LowOHA MatMul |
60+
61+
*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
62+
63+
## DataType Supports
64+
65+
| DataType | Status | Notes |
66+
|:----------------------:|:-------:|:---------------------------------------------:|
67+
| FP32 | Support | Full precision floating point |
68+
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
69+
70+
*Notes:*
71+
72+
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
73+
74+
## Linux
75+
76+
### I. Setup Environment
77+
78+
You have two options to set up ZenDNN:
79+
80+
#### Option 1: Automatic Download and Build (Recommended)
81+
82+
CMake will automatically download and build ZenDNN for you:
83+
84+
```sh
85+
# Build llama.cpp - ZenDNN will be automatically downloaded and built
86+
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
87+
cmake --build build --config Release -j $(nproc)
88+
```
89+
90+
No manual ZenDNN installation required. CMake will handle everything automatically.
91+
92+
#### Option 2: Use Custom ZenDNN Installation
93+
94+
If you want to build ZenDNN yourself or use a specific version:
95+
96+
**Step 1: Build ZenDNN from source**
97+
98+
```sh
99+
# Clone ZenDNN repository
100+
git clone https://github.com/amd/ZenDNN.git
101+
cd ZenDNN
102+
git checkout zendnnl
103+
104+
# Build and install (requires CMake >= 3.25)
105+
mkdir build && cd build
106+
cmake ..
107+
cmake --build . --target all
108+
```
109+
110+
Default installation path: `ZenDNN/build/install`
111+
112+
**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md).
113+
114+
**Step 2: Build llama.cpp with custom ZenDNN path**
115+
116+
```sh
117+
# Using environment variable
118+
export GGML_ZENDNN_PATH=/path/to/ZenDNN/build/install
119+
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
120+
cmake --build build --config Release -j $(nproc)
121+
122+
# OR specify path directly in CMake
123+
cmake -B build -DGGML_ZENDNN=ON -DGGML_ZENDNN_PATH=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
124+
cmake --build build --config Release -j $(nproc)
125+
```
126+
127+
### II. Run the Server
128+
129+
#### 1. Download Model
130+
131+
Download LLaMA 3.1 8B Instruct BF16 model:
132+
133+
```sh
134+
# Download from Hugging Face
135+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
136+
```
137+
138+
#### 2. Start Server
139+
140+
Run llama.cpp server with ZenDNN acceleration:
141+
142+
```sh
143+
# Set optimal configuration
144+
export OMP_NUM_THREADS=64 # Adjust to your CPU core count
145+
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance
146+
147+
# Start server
148+
./build/bin/llama-server \
149+
-m models/Llama-3.1-8B-Instruct.BF16.gguf \
150+
--host 0.0.0.0 \
151+
--port 8080 \
152+
-t 64
153+
```
154+
155+
Access the server at `http://localhost:8080`.
156+
157+
**Performance tips**:
158+
- Set `OMP_NUM_THREADS` to match your physical core count
159+
- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance
160+
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
161+
162+
## Environment Variable
163+
164+
### Build Time
165+
166+
| Name | Value | Function |
167+
|--------------------|---------------------------------------|---------------------------------------------|
168+
| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support |
169+
| GGML_ZENDNN_PATH | Path to ZenDNN installation | Set ZenDNN installation directory |
170+
| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading |
171+
172+
### Runtime
173+
174+
| Name | Value | Function |
175+
|-------------------------|--------------------------|-------------------------------------------------------------------|
176+
| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) |
177+
| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) |
178+
| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) |
179+
| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) |
180+
| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) |
181+
182+
**Example**:
183+
184+
```sh
185+
export OMP_NUM_THREADS=64
186+
export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance
187+
./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100
188+
```
189+
190+
## Performance Optimization
191+
192+
### MatMul Algorithm Selection
193+
194+
ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm:
195+
196+
```sh
197+
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended)
198+
```
199+
200+
**Available algorithms**:
201+
202+
| Value | Algorithm | Description |
203+
|:-----:|:-----------------------|:----------------------------------------------|
204+
| 0 | Dynamic Dispatch | Automatic backend selection (default) |
205+
| 1 | AOCL BLIS | AOCL BLIS backend |
206+
| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** |
207+
| 3 | OneDNN | OneDNN backend |
208+
| 4 | OneDNN Blocked | Blocked OneDNN |
209+
| 5 | LibXSMM | LibXSMM backend |
210+
211+
### Profiling and Debugging
212+
213+
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md).
214+
215+
## Known Issues
216+
217+
- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
218+
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
219+
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
220+
221+
## Q&A
222+
223+
**Q: How do I verify that ZenDNN backend is being used?**
224+
225+
A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
226+
227+
**Q: What performance improvement can I expect?**
228+
229+
A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
230+
231+
**Q: Can I use ZenDNN on non-AMD processors?**
232+
233+
A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
234+
235+
**Q: Does ZenDNN support quantized models?**
236+
237+
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
238+
239+
**Q: Why is my inference not faster with ZenDNN?**
240+
241+
A: Ensure:
242+
1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
243+
2. `OMP_NUM_THREADS` is set appropriately (physical core count)
244+
3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS)
245+
4. You're using a sufficiently large model (small models may not benefit as much)
246+
5. Enable profiling to verify ZenDNN MatMul is being called
247+
248+
### **GitHub Contribution**:
249+
Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
250+
251+
## TODO
252+
253+
- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)

docs/build.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -495,6 +495,38 @@ llama_new_context_with_model: CANN compute buffer size = 1260.81 MiB
495495

496496
For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
497497

498+
## ZenDNN
499+
500+
ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.
501+
502+
### Compilation
503+
504+
- Using `CMake` on Linux (automatic build):
505+
506+
```bash
507+
cmake -B build -DGGML_ZENDNN=ON
508+
cmake --build build --config Release
509+
```
510+
511+
The first build will automatically download and build ZenDNN, which may take 5-10 minutes. Subsequent builds will be much faster.
512+
513+
- Using `CMake` with custom ZenDNN installation:
514+
515+
```bash
516+
cmake -B build -DGGML_ZENDNN=ON -DGGML_ZENDNN_PATH=/path/to/zendnn/install
517+
cmake --build build --config Release
518+
```
519+
520+
### Testing
521+
522+
You can test with:
523+
524+
```bash
525+
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -n 50
526+
```
527+
528+
For detailed information about hardware support, setup instructions, and performance optimization, refer to [llama.cpp for ZenDNN](./backend/ZenDNN.md).
529+
498530
## Arm® KleidiAI™
499531
KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
500532

0 commit comments

Comments
 (0)