vlut.cpp (lookup table-based) vs. llama.cpp (dequantization-based) running Llama3-8B-1.58-100B-tokens on Intel Core Ultra 7 258V (see run_batched_decode.sh):
llama.demo.mp4
vlut.cpp vs. llama.cpp vs. T-MAC in GeMM kernel benchmark (see Evaluation.md for a detailed evaluation guide):
vlut.cpp is a lightweight extension of llama.cpp that implements Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices. It targets parallel ultra-low-bit LLM inference. Parallel scenarios include:
- Prefilling (parallel input, most common).
- Serving (mixed parallel input and output).
- Parallel test-time scaling and speculative decoding (parallel output).
The Vec-LUT kernel is fast with:
- Lookup table (LUT)-based design that replaces dequantization and multiplication with efficient table lookup.
- Vector LUT paradigm that performs efficient 1→N lookup and turns random lookup into contiguous vector addition.
- Vector LUT-centric tensor layout and cache-aware streamed lookup that optimizes the memory access patterns.
Based on the Vec-LUT kernel, vlut.cpp is efficient and easy to use with:
- llama.cpp-compatible kernel integration and similar usage.
- Heuristic tiling strategy without costly tuning.
vlut.cpp supports all mainstream CPUs (Intel, AMD, ARM), and operating systems (Linux, Android, Mac OS, Windows). You can build and test vlut.cpp on almost any platforms.
We recommend using the Windows Subsystem for Linux (WSL) on Windows, and Termux on Android. They provide Linux-like development environments.
Please refer to Evaluation.md for recommended specifications to run the evaluation.
vlut.cpp now supports a rich set of ternary (1.58-bit) LLMs:
- HF BitNet family
- Example:
1bitLLM/bitnet_b1_58-3B
- Example:
- Llama family (1.58-bit variants)
- Example:
HF1BitLLM/Llama3-8B-1.58-100B-tokens
- Example:
- Falcon3 family
- Example:
tiiuae/Falcon3-1B-Instruct-1.58bit
- Example:
- TriLM family
- Example:
SpectraSuite/TriLM_3.9B_Unpacked
- Example:
This section walks you through the minimum steps required to run a ternary LLM with vlut.cpp:
- Install and build vlut.cpp.
- Convert a HuggingFace model into vlut-compatible GGUF.
- Quantize the model using Vec-LUT packings (I1 / I2).
- Run inference using
llama-clior benchmark withllama-bench.
For a more detailed evaluation pipeline (GeMM, prefill, batched decoding, multi-framework comparison), see Evaluation.md.
vlut.cpp follows the same build process as llama.cpp (CPU build), see how to build.
Run the following commands to build vlut.cpp with 4 parallel jobs:
cmake -B build
cmake --build build --config Release -j 4Before quantization, HuggingFace models (safetensors) must be converted to vlut GGUF.
Install dependencies:
pip install -r requirements.txtConvert a model (BitNet 3B for example):
python ./convert_hf_to_gguf_vlut.py ~/models/bitnet_b1_58-3B --outfile ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.ggufvlut.cpp provides lossless ternary packings I1 and I2, with optional K-tiling variants (e.g., I1_V_2, I2_V_4).
Quantize the converted GGUF:
./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I1_V_2
./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I2_V_8The quantized model will be saved as ggml-model-{quant_type}.gguf.
Use llama-cli to perform a quick functional check:
./build/bin/llama-cli -m model.gguf -p "I believe the meaning of life is" -no-cnvllama-bench lets you measure the performance of the inference for various parameters.
example:
./build/bin/llama-bench -m model.gguf -t 4 -p 128 -n 0This project is built atop llama.cpp. Thanks to all the contributors for their valuable works!
The LUT-based idea is inspired by T-MAC, which is primarily optimized for non-parallel scenarios (e.g., single-batch decoding).
If you find this project useful, please cite our paper:
@article{li2025veclut,
title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
journal={arXiv preprint arXiv:2512.06443},
year={2025},
url={https://arxiv.org/abs/2512.06443}
}