Feature/kimi linear support #17592

cacaview · 2025-11-29T15:10:52Z

Make sure to read the contributing guidelines before submitting a PR
This is the current work progress：
#16930 (comment)

- Implement KDA layer (linear attention with gates and decay) - Implement MLA layer (multi-head latent attention with KV compression) - Support MoE FFN with shared experts - Add TikToken tokenizer support for Kimi models - Fix vocab loading for large vocabularies - Model loads and runs inference (27 layers, 603 tensors)

- Add missing MoE metadata to GGUF conversion: - moe_intermediate_size (1024) - num_shared_experts (1) - first_k_dense_replace (1) - routed_scaling_factor (2.446) - expert_gating_func (sigmoid) - Fix MoE gating function default to SIGMOID (was SOFTMAX) - Add expert_weights_scale loading with default 2.446 - Enable moe_renormalize (norm_w=true) in build_moe_ffn - Add fallback for exp_probs_b tensor suffix compatibility

- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu) - Fix recurrence order: decay first, then retrieval - Verify CPU/CUDA implementation consistency - Support head_dim=128, L2 normalization for Q/K

CISC · 2025-11-29T16:52:10Z

convert_hf_to_gguf.py

+# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass
+# This old definition has been removed to avoid conflicts
+
+
 @ModelBase.register(


Suggested change

# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass

# This old definition has been removed to avoid conflicts

@ModelBase.register(

@ModelBase.register(

CISC · 2025-11-29T16:53:57Z

convert_hf_to_gguf.py

                (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_Q, bid), q),
                (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_K, bid), k),
                (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_V, bid), v),
            ]


Suggested change

]

]

else:

return [(self.map_tensor_name(name), data_torch)]

CISC · 2025-11-29T16:55:59Z

convert_hf_to_gguf.py

+@ModelBase.register("KimiLinearModel", "KimiLinearForCausalLM")
+class KimiLinearModel(TextModel):
+    """Kimi-Linear model with hybrid MLA+KDA architecture"""
+    model_arch = gguf.MODEL_ARCH.KIMI


Suggested change

model_arch = gguf.MODEL_ARCH.KIMI

model_arch = gguf.MODEL_ARCH.KIMI_LINEAR

CISC · 2025-11-29T17:02:23Z

convert_hf_to_gguf.py

+    _experts: list[dict[str, Tensor]] | None = None
+
+    def set_gguf_parameters(self):
+        self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])


Suggested change

self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])

super().set_gguf_parameters()

self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])

CISC · 2025-11-29T17:03:02Z

convert_hf_to_gguf.py

+        # Use find_hparam for context length
+        # Kimi uses model_max_length
+        n_ctx = self.find_hparam(["max_position_embeddings", "model_max_length", "n_ctx", "n_positions"], optional=True)
+        if n_ctx is not None:
+            self.gguf_writer.add_context_length(n_ctx)
        else:
-            return [(self.map_tensor_name(name), data_torch)]
+            # Default to 4096 if not found
+            logger.warning("No context length found in config, defaulting to 4096")
+            self.gguf_writer.add_context_length(4096)


Add model_max_length to TextModel.set_gguf_parameters instead, the fallback is not necessary.

src/llama-model.cpp

src/llama-quant.cpp

src/models/kimi-linear.cpp

src/models/kimi.cpp

src/models/models.h

…view/llama.cpp into feature/kimi-linear-support

cacaview · 2025-11-30T03:52:05Z

I have fixed these errors in the commit at cacaview@780dd78

CISC · 2025-11-30T12:36:43Z

I have fixed these errors in the commit at cacaview@780dd78

Please address the remaining unresolved ones as well.

cacaview · 2025-12-01T03:00:17Z

I conducted some simple tests and encountered some issues. The root causes are still unclear.

Test Environment

Model: E:\llama\Kimi-Linear-48B-A3B-Instruct\Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
Quantization: Q4_K_M (27.65 GiB, 4.83 BPW)
GPU: NVIDIA GeForce RTX 4070 (12GB VRAM)
Parameters: --temp 0 -ngl 99 -no-cnv --no-warmup
Prompt Format: user: <question>\nassistant:

Test Logs

Test 1: Simple Greeting


user: Hello!

  

assistant: Hello! How can I help you today? [end of text]

  

common_perf_print: prompt eval time =   10517.48 ms /    23 tokens (  457.28 ms per token,     2.19 tokens per second)

common_perf_print:        eval time =    8790.42 ms /     9 runs   (  976.71 ms per token,     1.02 tokens per second)

common_perf_print:        total time =   19359.92 ms /    32 tokens

Test 2: Simple Math


user: What is 25 + 37?

  

assistant: Let's add the numbers step by step:

- 25 + 25 = 50

  

So the answer is **50**. [end of text]

  

common_perf_print: prompt eval time =    1644.97 ms /    33 tokens (   49.85 ms per token,    20.06 tokens per second)

common_perf_print:        eval time =    4486.75 ms /    25 runs   (  179.47 ms per token,     5.57 tokens per second)

common_perf_print:        total time =    6195.25 ms /    58 tokens

Incorrect calculation: 25 + 37 = 62, not 50.

Test 3: Knowledge Q&A


user: What is the capital of France?

  

assistant: The capital of France is Paris. [end of text]

  

common_perf_print: prompt eval time =    1561.97 ms /    47 tokens (   33.23 ms per token,    30.09 tokens per second)

common_perf_print:        eval time =    1179.26 ms /     7 runs   (  168.47 ms per token,     5.94 tokens per second)

common_perf_print:        total time =    2761.82 ms /    54 tokens

Test 4: Chinese Test


user: 请用中文介绍一下你自己

  

assistant: I see that you've copied the entire text. This is a very important message. I will now...

Chinese input encountered encoding issues in PowerShell, and the model failed to process Chinese correctly.

Test 5: Code Generation


user: Write a Python function to calculate factorial

  

assistant: Here's a simple and efficient implementation using recursion:

  

python

def factorial(n):

    if n == 0:

        return 1

    else:

        return n * factorial(n - 1)

  

# Example usage

print(factorial(5))  # Output: 120

  

---

### Explanation:

- **Base Case**: The function checks if `n` is 0, in which case it returns 1.

- **Recursive Case**: It calls itself with `n-1` to compute the factorial.

  

This is a classic example of a recursive function. [end of text]

  

common_perf_print: prompt eval time =    2516.11 ms /    63 tokens (   39.94 ms per token,    25.04 tokens per second)

common_perf_print:        eval time =   20525.99 ms /   117 runs   (  175.44 ms per token,     5.70 tokens per second)

common_perf_print:        total time =   23294.02 ms /   180 tokens

Test 6: Concept Explanation (Repetitive Output)


user: Explain the concept of recursion in programming with an example.

  

assistant:# The user is asking for an explanation of recursion in programming. The user is asking for a simple explanation of recursion.

  

The user is asking: "What is recursion?" or "What is recursion?" or "What is recursion?" or "What is recursion?" or " " " " " " " " " " " " " " " " " " " "

Severe repetitive output issue occurred.

Test 7: Logical Reasoning


user: If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?

  

assistant: Yes, we can conclude that some roses fade quickly. Let's break it down step by step.

  

1. **Understanding the Problem**:  

  

   - **Premise**: All roses are flowers.  

  

   - **Conclusion**: Some roses are flowers.  

  

2. **Logical Form**:  

  

   - **Premise**: All roses are flowers.  

  

   - **Conclusion**: Therefore, roses are flowers.  

  

3. **Alternative Approach**:  

  

   - **Logical Form**:  

  

     - **

  

common_perf_print: prompt eval time =    4170.31 ms /   118 tokens (   35.34 ms per token,    28.30 tokens per second)

common_perf_print:        eval time =   16314.01 ms /    99 runs   (  164.79 ms per token,     6.07 tokens per second)

common_perf_print:        total time =   20693.48 ms /   217 tokens

Incorrect logical reasoning. The correct answer should be "Cannot be determined".

engrtipusultan · 2025-12-02T15:34:04Z

I conducted some simple tests and encountered some issues. The root causes are still unclear.

Test Environment

* Model: `E:\llama\Kimi-Linear-48B-A3B-Instruct\Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf`

* Quantization: Q4_K_M (27.65 GiB, 4.83 BPW)

* GPU: NVIDIA GeForce RTX 4070 (12GB VRAM)

* Parameters: `--temp 0 -ngl 99 -no-cnv --no-warmup`

* Prompt Format: `user: <question>\nassistant:`

Test Logs

Test 1: Simple Greeting


user: Hello!

  

assistant: Hello! How can I help you today? [end of text]

  

common_perf_print: prompt eval time =   10517.48 ms /    23 tokens (  457.28 ms per token,     2.19 tokens per second)

common_perf_print:        eval time =    8790.42 ms /     9 runs   (  976.71 ms per token,     1.02 tokens per second)

common_perf_print:        total time =   19359.92 ms /    32 tokens

Test 2: Simple Math


user: What is 25 + 37?

  

assistant: Let's add the numbers step by step:

- 25 + 25 = 50

  

So the answer is **50**. [end of text]

  

common_perf_print: prompt eval time =    1644.97 ms /    33 tokens (   49.85 ms per token,    20.06 tokens per second)

common_perf_print:        eval time =    4486.75 ms /    25 runs   (  179.47 ms per token,     5.57 tokens per second)

common_perf_print:        total time =    6195.25 ms /    58 tokens

Incorrect calculation: 25 + 37 = 62, not 50.

Test 3: Knowledge Q&A


user: What is the capital of France?

  

assistant: The capital of France is Paris. [end of text]

  

common_perf_print: prompt eval time =    1561.97 ms /    47 tokens (   33.23 ms per token,    30.09 tokens per second)

common_perf_print:        eval time =    1179.26 ms /     7 runs   (  168.47 ms per token,     5.94 tokens per second)

common_perf_print:        total time =    2761.82 ms /    54 tokens

Test 4: Chinese Test


user: 请用中文介绍一下你自己

  

assistant: I see that you've copied the entire text. This is a very important message. I will now...

Chinese input encountered encoding issues in PowerShell, and the model failed to process Chinese correctly.

Test 5: Code Generation


user: Write a Python function to calculate factorial

  

assistant: Here's a simple and efficient implementation using recursion:

  

python

def factorial(n):

    if n == 0:

        return 1

    else:

        return n * factorial(n - 1)

  

# Example usage

print(factorial(5))  # Output: 120

  

---

### Explanation:

- **Base Case**: The function checks if `n` is 0, in which case it returns 1.

- **Recursive Case**: It calls itself with `n-1` to compute the factorial.

  

This is a classic example of a recursive function. [end of text]

  

common_perf_print: prompt eval time =    2516.11 ms /    63 tokens (   39.94 ms per token,    25.04 tokens per second)

common_perf_print:        eval time =   20525.99 ms /   117 runs   (  175.44 ms per token,     5.70 tokens per second)

common_perf_print:        total time =   23294.02 ms /   180 tokens

Test 6: Concept Explanation (Repetitive Output)


user: Explain the concept of recursion in programming with an example.

  

assistant:# The user is asking for an explanation of recursion in programming. The user is asking for a simple explanation of recursion.

  

The user is asking: "What is recursion?" or "What is recursion?" or "What is recursion?" or "What is recursion?" or " " " " " " " " " " " " " " " " " " " "

Severe repetitive output issue occurred.

Test 7: Logical Reasoning


user: If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?

  

assistant: Yes, we can conclude that some roses fade quickly. Let's break it down step by step.

  

1. **Understanding the Problem**:  

  

   - **Premise**: All roses are flowers.  

  

   - **Conclusion**: Some roses are flowers.  

  

2. **Logical Form**:  

  

   - **Premise**: All roses are flowers.  

  

   - **Conclusion**: Therefore, roses are flowers.  

  

3. **Alternative Approach**:  

  

   - **Logical Form**:  

  

     - **

  

common_perf_print: prompt eval time =    4170.31 ms /   118 tokens (   35.34 ms per token,    28.30 tokens per second)

common_perf_print:        eval time =   16314.01 ms /    99 runs   (  164.79 ms per token,     6.07 tokens per second)

common_perf_print:        total time =   20693.48 ms /   217 tokens

Incorrect logical reasoning. The correct answer should be "Cannot be determined".

@CISC is this valid method to check correctness of models implementation?
Also for llama.cpp even if the following is used, doesn't default values of top p and k add variance in response.
Parameters: --temp 0 -ngl 99 -no-cnv --no-warmup

cacaview and others added 5 commits November 28, 2025 23:42

Add Kimi Linear model conversion support

7c0334e

feat(kimi): add KDA CUDA kernel and optimize recurrence implementation

6b20da1

- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu) - Fix recurrence order: decay first, then retrieval - Verify CPU/CUDA implementation consistency - Support head_dim=128, L2 normalization for Q/K

Merge branch 'ggml-org:master' into feature/kimi-linear-support

1b29643

cacaview requested review from CISC and ggerganov as code owners November 29, 2025 15:10

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 29, 2025

loci-dev mentioned this pull request Nov 29, 2025

UPSTREAM PR #17592: Feature/kimi linear support auroralabs-loci/llama.cpp#363

Open

CISC reviewed Nov 29, 2025

View reviewed changes

cacaview and others added 3 commits November 30, 2025 11:43

fix: ggml-org#17592 (review)

780dd78

Merge branch 'feature/kimi-linear-support' of https://github.com/caca…

3a7e87f

…view/llama.cpp into feature/kimi-linear-support

Merge branch 'ggml-org:master' into feature/kimi-linear-support

02d3d8d

	model_arch = gguf.MODEL_ARCH.KIMI
	model_arch = gguf.MODEL_ARCH.KIMI_LINEAR

	self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
	super().set_gguf_parameters()
	self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])

Feature/kimi linear support #17592

Are you sure you want to change the base?

Feature/kimi linear support #17592

Conversation

cacaview commented Nov 29, 2025

Uh oh!

CISC Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cacaview commented Nov 30, 2025

Uh oh!

CISC commented Nov 30, 2025

Uh oh!

cacaview commented Dec 1, 2025

Test Environment

Test Logs

Test 1: Simple Greeting

Test 2: Simple Math

Test 3: Knowledge Q&A

Test 4: Chinese Test

Test 5: Code Generation

Test 6: Concept Explanation (Repetitive Output)

Test 7: Logical Reasoning

Uh oh!

engrtipusultan commented Dec 2, 2025

Test Environment

Test Logs

Test 1: Simple Greeting

Test 2: Simple Math

Test 3: Knowledge Q&A

Test 4: Chinese Test

Test 5: Code Generation

Test 6: Concept Explanation (Repetitive Output)

Test 7: Logical Reasoning

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants