vlut.cpp

vlut.cpp (lookup table-based) vs. llama.cpp (dequantization-based) running Llama3-8B-1.58-100B-tokens on Intel Core Ultra 7 258V (see run_batched_decode.sh):

llama.demo.mp4

vlut.cpp vs. llama.cpp vs. T-MAC in GeMM kernel benchmark (see Evaluation.md for a detailed evaluation guide):

Introduction

vlut.cpp is a lightweight extension of llama.cpp that implements Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices. It targets parallel ultra-low-bit LLM inference. Parallel scenarios include:

Prefilling (parallel input, most common).
Serving (mixed parallel input and output).
Parallel test-time scaling and speculative decoding (parallel output).

The Vec-LUT kernel is fast with:

Lookup table (LUT)-based design that replaces dequantization and multiplication with efficient table lookup.
Vector LUT paradigm that performs efficient 1→N lookup and turns random lookup into contiguous vector addition.
Vector LUT-centric tensor layout and cache-aware streamed lookup that optimizes the memory access patterns.

Based on the Vec-LUT kernel, vlut.cpp is efficient and easy to use with:

llama.cpp-compatible kernel integration and similar usage.
Heuristic tiling strategy without costly tuning.

Supported Platforms

vlut.cpp supports all mainstream CPUs (Intel, AMD, ARM), and operating systems (Linux, Android, Mac OS, Windows). You can build and test vlut.cpp on almost any platforms.

We recommend using the Windows Subsystem for Linux (WSL) on Windows, and Termux on Android. They provide Linux-like development environments.

Please refer to Evaluation.md for recommended specifications to run the evaluation.

Supported Models

vlut.cpp now supports a rich set of ternary (1.58-bit) LLMs:

HF BitNet family
- Example: 1bitLLM/bitnet_b1_58-3B
Llama family (1.58-bit variants)
- Example: HF1BitLLM/Llama3-8B-1.58-100B-tokens
Falcon3 family
- Example: tiiuae/Falcon3-1B-Instruct-1.58bit
TriLM family
- Example: SpectraSuite/TriLM_3.9B_Unpacked

Quick Start

This section walks you through the minimum steps required to run a ternary LLM with vlut.cpp:

Install and build vlut.cpp.
Convert a HuggingFace model into vlut-compatible GGUF.
Quantize the model using Vec-LUT packings (I1 / I2).
Run inference using llama-cli or benchmark with llama-bench.

For a more detailed evaluation pipeline (GeMM, prefill, batched decoding, multi-framework comparison), see Evaluation.md.

1. Installation

vlut.cpp follows the same build process as llama.cpp (CPU build), see how to build.

Run the following commands to build vlut.cpp with 4 parallel jobs:

cmake -B build
cmake --build build --config Release -j 4

2. Convert a HuggingFace model to GGUF

Before quantization, HuggingFace models (safetensors) must be converted to vlut GGUF.

Install dependencies:

pip install -r requirements.txt

Convert a model (BitNet 3B for example):

python ./convert_hf_to_gguf_vlut.py ~/models/bitnet_b1_58-3B --outfile ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf

3. Quantize the model with Vec-LUT packings

vlut.cpp provides lossless ternary packings I1 and I2, with optional K-tiling variants (e.g., I1_V_2, I2_V_4).

Quantize the converted GGUF:

./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I1_V_2

./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I2_V_8

The quantized model will be saved as ggml-model-{quant_type}.gguf.

4. Run inference

Use llama-cli to perform a quick functional check:

./build/bin/llama-cli -m model.gguf -p "I believe the meaning of life is" -no-cnv

5. Benchmark performance

llama-bench lets you measure the performance of the inference for various parameters.

example:

./build/bin/llama-bench -m model.gguf -t 4 -p 128 -n 0

Acknowledgement

This project is built atop llama.cpp. Thanks to all the contributors for their valuable works!

The LUT-based idea is inspired by T-MAC, which is primarily optimized for non-parallel scenarios (e.g., single-batch decoding).

Citation

If you find this project useful, please cite our paper:

@article{li2025veclut,
  title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
  author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
  journal={arXiv preprint arXiv:2512.06443},
  year={2025},
  url={https://arxiv.org/abs/2512.06443}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Sources/llama		Sources/llama
cmake		cmake
common		common
docs		docs
evaluation		evaluation
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
spm-headers		spm-headers
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_hf_to_gguf_vlut.py		convert_hf_to_gguf_vlut.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vlut.cpp

Introduction

Supported Platforms

Supported Models

Quick Start

1. Installation

2. Convert a HuggingFace model to GGUF

3. Quantize the model with Vec-LUT packings

4. Run inference

5. Benchmark performance

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

Cipherxzc/vlut.cpp

Folders and files

Latest commit

History

Repository files navigation

vlut.cpp

Introduction

Supported Platforms

Supported Models

Quick Start

1. Installation

2. Convert a HuggingFace model to GGUF

3. Quantize the model with Vec-LUT packings

4. Run inference

5. Benchmark performance

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages