Skip to content

Cipherxzc/vlut.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vlut.cpp

License: MIT arXiv

vlut.cpp (lookup table-based) vs. llama.cpp (dequantization-based) running Llama3-8B-1.58-100B-tokens on Intel Core Ultra 7 258V (see run_batched_decode.sh):

llama.demo.mp4

vlut.cpp vs. llama.cpp vs. T-MAC in GeMM kernel benchmark (see Evaluation.md for a detailed evaluation guide):

Vec-LUT kernel benchmark

Introduction

vlut.cpp is a lightweight extension of llama.cpp that implements Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices. It targets parallel ultra-low-bit LLM inference. Parallel scenarios include:

  • Prefilling (parallel input, most common).
  • Serving (mixed parallel input and output).
  • Parallel test-time scaling and speculative decoding (parallel output).

The Vec-LUT kernel is fast with:

  • Lookup table (LUT)-based design that replaces dequantization and multiplication with efficient table lookup.
  • Vector LUT paradigm that performs efficient 1→N lookup and turns random lookup into contiguous vector addition.
  • Vector LUT-centric tensor layout and cache-aware streamed lookup that optimizes the memory access patterns.

Based on the Vec-LUT kernel, vlut.cpp is efficient and easy to use with:

  • llama.cpp-compatible kernel integration and similar usage.
  • Heuristic tiling strategy without costly tuning.

Supported Platforms

vlut.cpp supports all mainstream CPUs (Intel, AMD, ARM), and operating systems (Linux, Android, Mac OS, Windows). You can build and test vlut.cpp on almost any platforms.

We recommend using the Windows Subsystem for Linux (WSL) on Windows, and Termux on Android. They provide Linux-like development environments.

Please refer to Evaluation.md for recommended specifications to run the evaluation.

Supported Models

vlut.cpp now supports a rich set of ternary (1.58-bit) LLMs:

Quick Start

This section walks you through the minimum steps required to run a ternary LLM with vlut.cpp:

  1. Install and build vlut.cpp.
  2. Convert a HuggingFace model into vlut-compatible GGUF.
  3. Quantize the model using Vec-LUT packings (I1 / I2).
  4. Run inference using llama-cli or benchmark with llama-bench.

For a more detailed evaluation pipeline (GeMM, prefill, batched decoding, multi-framework comparison), see Evaluation.md.

1. Installation

vlut.cpp follows the same build process as llama.cpp (CPU build), see how to build.

Run the following commands to build vlut.cpp with 4 parallel jobs:

cmake -B build
cmake --build build --config Release -j 4

2. Convert a HuggingFace model to GGUF

Before quantization, HuggingFace models (safetensors) must be converted to vlut GGUF.

Install dependencies:

pip install -r requirements.txt

Convert a model (BitNet 3B for example):

python ./convert_hf_to_gguf_vlut.py ~/models/bitnet_b1_58-3B --outfile ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf

3. Quantize the model with Vec-LUT packings

vlut.cpp provides lossless ternary packings I1 and I2, with optional K-tiling variants (e.g., I1_V_2, I2_V_4).

Quantize the converted GGUF:

./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I1_V_2

./build/bin/llama-quantize ~/models/bitnet_b1_58-3B/bitnet_b1_58-3B.vlut.gguf I2_V_8

The quantized model will be saved as ggml-model-{quant_type}.gguf.

4. Run inference

Use llama-cli to perform a quick functional check:

./build/bin/llama-cli -m model.gguf -p "I believe the meaning of life is" -no-cnv

5. Benchmark performance

llama-bench lets you measure the performance of the inference for various parameters.

example:

./build/bin/llama-bench -m model.gguf -t 4 -p 128 -n 0

Acknowledgement

This project is built atop llama.cpp. Thanks to all the contributors for their valuable works!

The LUT-based idea is inspired by T-MAC, which is primarily optimized for non-parallel scenarios (e.g., single-batch decoding).

Citation

If you find this project useful, please cite our paper:

@article{li2025veclut,
  title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
  author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
  journal={arXiv preprint arXiv:2512.06443},
  year={2025},
  url={https://arxiv.org/abs/2512.06443}
}

About

Parallel inference for ultra-low-bit LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages