-
Notifications
You must be signed in to change notification settings - Fork 24
[Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays
#79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays
#79
Conversation
…n in TornadoVM acceleration.
…el loaders for consistent tensor loading.
# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java
…ray.fromSegmentShallow`
…0Byte` kernels for Q8_0 matrix-vector computations
…trix-vector computations
…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations
… compute kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.
Key Changes:
- New Q8_0 kernel implementations using ByteArray format with inline dequantization
- Addition of
modelType()to Configuration interface to distinguish FP16 vs Q8_0 models - New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.
Show a summary per file
| File | Description |
|---|---|
TransformerComputeKernelsLayered.java |
Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization |
TransformerComputeKernels.java |
Implements conversion kernels for FP16 and Q8_0 to FP32 format |
Q8_0TornadoTensor.java |
Adds ByteArray constructor and factory method; removes old unpacking methods |
TornadoTensor.java |
Adds asByteArray() method for Q8_0 tensor access |
Configuration.java + implementations |
Adds modelType() method to distinguish FP16 vs Q8_0 models |
AbstractModelLoader.java |
Implements readModelType() to map GGUF file types to model type strings |
ModelLoader.java |
Simplifies tensor loading by removing FP32 conversion helper |
State.java + implementations |
Adds embeddingX field and buffer allocation methods for quantized embeddings |
Activation.java |
Refactors to perform format conversion based on model type |
InferenceCore.java |
Updates token embedding copying to handle FP16 and Q8_0 formats |
| Various FFN layer files | Updates to use new ByteArray-based kernel APIs |
LogitsQ8_0Layer.java |
Updates to use new ByteArray-based kernel API |
| Various loader files | Removes loadTornadoTensorAsFP32 usage in favor of unified loading |
Comments suppressed due to low confidence (1)
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49
- The method
getSize()returnssizewhich will be-1if the tensor was created using the newQ8_0TornadoTensor(ByteArray)constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray iftornadoNativeArrayis not null.
public int getSize() {
return size;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/inference/state/Qwen3State.java
Outdated
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/inference/state/Phi3State.java
Outdated
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/model/phi3/Phi3Configuration.java
Outdated
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java
Show resolved
Hide resolved
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java
Outdated
Show resolved
Hide resolved
|
/rerun all |
|
🚀 Workflow rerun started Mode: |
|
✅ Workflow rerun success |
… associated usages.
fc1cc89 to
6b66a59
Compare
ByteArraysByteArrays
| return switch (modelQuantizationAsInt) { | ||
| case 1 -> "FP16"; | ||
| case 7 -> "Q8_0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these magic numbers 1 & 7?
Description
This update integrates the new TornadoVM Q8_0
ByteArraykernels into GPULlama3, enabling unified memory layout for quantized transformer inference. The changes replace separate scale and quantized value arrays with a singleByteArrayrepresentation that matches the GGUF Q8_0 format, improving memory efficiency and performance.Key Features:
ByteArraySupport: Integration of TornadoVM's newByteArrayHalfFloatmethods for Q8_0 quantized weightsconvertQ8_0toFP32()kernel for efficient Q8_0 → FP32 dequantizationByteArrayQ8_0 formatByteArraymapping without intermediate conversionsProblem Description
The previous implementation required converting GGUF Q8_0 data into separate
Int8Array(quantized values) andHalfFloatArray(scales) structures, causing:The GGUF Q8_0 format stores data as interleaved blocks (2-byte HalfFloat scale + 32 quantized bytes), which maps naturally to a single ByteArray.