Per-row Quantization #832
Replies: 4 comments
-
|
As this is not an issue, I took the liberty to convert the question to a discussion.
Can you be more specific? |
Beta Was this translation helpful? Give feedback.
-
|
Given this information, I would like to ask if this project have tried to add vector-wise quantization blocks as leveraged by most of quantization methods for NVIDIA GPU? If so, I would like to ask was there any speed overhead due to fine-grained group size in CPU SIMD inference? Most of quantization papers that quantizes both weights and activations to reduce memory demand and increasing parallel Computing ability such as QuaRot, SpinQuant, Qserve in NVIDIA GPU hardware. As far as I know, these methods have implemented per-channel for weight, per-token for activation. Issue that arise when weight and activations are quantized group-wise as in llama.cpp in NVIDIA GPU is that dequantizing partial sums need to be executed on CUDA cores which is much slower than Tensor cores, underutilizing maximum computing performace. However, llama.cpp quantizes both weight and activations group-wise. So, I was wondering whether inference overhead that comes from dequantizing each partial sum in SIMD instructions in CPU is less severe than that of NVIDIA GPU. So I asked ChatGpt for the reason of group-wise quantization in CPU. What they replied is cache access alignment matters in CPU SIMD instructions and implementing per-channel (per-token) quantization might break Cache access misalignment. The reason, I've asked you specifically is that on open discussion in llama.cpp github repository january, 2024, you said that you have implemented per-row quantization in your modified version of llama.cpp. So I would like to ask whether there was actual speed improvements when quantized per-row compared to preexisting group_wise quantization? Thanks in advance |
Beta Was this translation helpful? Give feedback.
-
|
So, there are several quantization types in
In CUDA token generation performance is heavily memory bound, so performance is mainly determined by the number of bits per weight, and is not really dependent on block sizes. CUDA prompt processing performance (a.k.a. prefill) is mostly determined by block size. Quants using blocks of 32 tend to be 10-30% faster than quants using blocks of 16. This is easily understandable, given the type of SIMD instructions we have available. CPU token generation speed is even more heavily memory bound, so performance is determined by the number of bits per weight. But because the CPU does not much as much computing power as a GPU, there is also some dependence on the complexity of the bit packing. CPU prompt processing speed is, to first order, independent of the quantization type. At least this is case in Usage of per-row float scale has a negligible impact on performance compared to super-blocks. The main advantage is that we can save 0.0625 or 0.125 bits per model weight, thus having a slightly smaller quantized model size. Does this answer your question? |
Beta Was this translation helpful? Give feedback.
-
|
Usage of per-row float scale has a negligible impact on performance compared to super-blocks. The main advantage is that we can save 0.0625 or 0.125 bits per model weight, thus having a slightly smaller quantized model size. Yes, this answered my question. If possible, may I ask underlying reason behind this issue? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm curious about the quantization on llama.cpp. In structure data types for quantization in llama.cpp such as q4_0, this project have limited the scope of quantization block size to at most 256. In additon, I've found out that Int8 SIMD instructions are utilized to accelerate inference.
Given this information, I would like to ask if this project have tried to add vector-wise quantization blocks as leveraged by most of quantization methods for NVIDIA GPU? If so, I would like to ask was there any speed overhead due to fine-grained group size in CPU SIMD inference?
Beta Was this translation helpful? Give feedback.
All reactions