Skip to content

Conversation

@orionpapadakis
Copy link
Collaborator

@orionpapadakis orionpapadakis commented Dec 4, 2025

Description

This update integrates the new TornadoVM Q8_0 ByteArray kernels into GPULlama3, enabling unified memory layout for quantized transformer inference. The changes replace separate scale and quantized value arrays with a single ByteArray representation that matches the GGUF Q8_0 format, improving memory efficiency and performance.

Key Features:

  • Unified Q8_0 ByteArray Support: Integration of TornadoVM's new ByteArray HalfFloat methods for Q8_0 quantized weights
  • Q8_0 Conversion Kernels: New convertQ8_0toFP32() kernel for efficient Q8_0 → FP32 dequantization
  • Matrix-Vector Q8_0 Kernels: Updated transformer compute kernels supporting ByteArray Q8_0 format
  • Memory Layout Optimization: Direct GGUF → ByteArray mapping without intermediate conversions

Problem Description

The previous implementation required converting GGUF Q8_0 data into separate Int8Array (quantized values) and HalfFloatArray (scales) structures, causing:

  1. Performance penalty during model loading from data transformation and separate array accesses
  2. Complexity in managing multiple array types for a single logical tensor

The GGUF Q8_0 format stores data as interleaved blocks (2-byte HalfFloat scale + 32 quantized bytes), which maps naturally to a single ByteArray.

mikepapadim and others added 25 commits December 4, 2025 19:44
# Conflicts:
#	src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java
…0Byte` kernels for Q8_0 matrix-vector computations
…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations
Copilot finished reviewing on behalf of mikepapadim December 5, 2025 12:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.

Key Changes:

  • New Q8_0 kernel implementations using ByteArray format with inline dequantization
  • Addition of modelType() to Configuration interface to distinguish FP16 vs Q8_0 models
  • New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.

Show a summary per file
File Description
TransformerComputeKernelsLayered.java Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization
TransformerComputeKernels.java Implements conversion kernels for FP16 and Q8_0 to FP32 format
Q8_0TornadoTensor.java Adds ByteArray constructor and factory method; removes old unpacking methods
TornadoTensor.java Adds asByteArray() method for Q8_0 tensor access
Configuration.java + implementations Adds modelType() method to distinguish FP16 vs Q8_0 models
AbstractModelLoader.java Implements readModelType() to map GGUF file types to model type strings
ModelLoader.java Simplifies tensor loading by removing FP32 conversion helper
State.java + implementations Adds embeddingX field and buffer allocation methods for quantized embeddings
Activation.java Refactors to perform format conversion based on model type
InferenceCore.java Updates token embedding copying to handle FP16 and Q8_0 formats
Various FFN layer files Updates to use new ByteArray-based kernel APIs
LogitsQ8_0Layer.java Updates to use new ByteArray-based kernel API
Various loader files Removes loadTornadoTensorAsFP32 usage in favor of unified loading
Comments suppressed due to low confidence (1)

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49

  • The method getSize() returns size which will be -1 if the tensor was created using the new Q8_0TornadoTensor(ByteArray) constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray if tornadoNativeArray is not null.
    public int getSize() {
        return size;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mikepapadim
Copy link
Member

/rerun all

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Workflow rerun success

View Actions

@orionpapadakis orionpapadakis changed the title [WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays [Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays Dec 8, 2025
Comment on lines +40 to +42
return switch (modelQuantizationAsInt) {
case 1 -> "FP16";
case 7 -> "Q8_0";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these magic numbers 1 & 7?

@mikepapadim mikepapadim merged commit edc8fac into beehive-lab:main Dec 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants