Experimental FP8 KV Cache #25

createthis · 2025-12-06T17:02:26Z

Chose to use 656 byte layout similar to VLLM, rather than a more ggml friendly layout, in the hope of better performance with modified tilelang and VLLM kernels.

This is a beautiful implementation of an FP8 MLA latent KV Cache. Unfortunately, that wasn't what I intended to build at all. What I intended to build was a llama.cpp clone of VLLM's DeepseekV32IndexerCache.

I'm still just seeing 5.6 to 5.8 tok/s because the main kernels are not actually consuming the new FP8 K cache yet.

Here it is with LLAMA_SPARSE_PROF=1 LLAMA_DEEPSEEK32_FP8_K=1 getting 5.6 tok/s:

[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.249 over 50 calls                                           
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.256 over 50 calls                                                             
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.346 over 50 calls                                                                
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.241 over 50 calls

And here is is with LLAMA_SPARSE_PROF=1 LLAMA_SPARSE_TOPK_TL=1 LLAMA_DEEPSEEK32_FP8_K=1 getting 5.8 tok/s:

[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.249 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.256 over 50 calls
[PROFILE_TL_ONLY] TILELANG_TOPK N=163840 T=1 k=256 avg_ms=0.019 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.057 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.242 over 50 calls

And finally, because the tilelang indexer is a toy, way less than 1 tok/s with LLAMA_SPARSE_PROF=1 LLAMA_INDEXER_TL_PORT=1 LLAMA_SPARSE_TOPK_TL=1 LLAMA_DEEPSEEK32_FP8_K=1:

[PROFILE_TL_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=535.981 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=536.282 over 50 calls
[PROFILE_TL_ONLY] TILELANG_TOPK N=163840 T=1 k=256 avg_ms=0.019 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.061 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.240 over 50 calls

…d wiring.

createthis · 2025-12-08T03:09:42Z

Closing in favor of #26

Experimental FP8 KV Cache. Not wired into anything yet.

0c5c04c

createthis self-assigned this Dec 6, 2025

createthis added 4 commits December 6, 2025 21:31

Wired into the build, but not yet wired into DeepSeek V3.2

952cc97

More FP8 KV changes

513ea61

Flesh out get_k

8981f5a

Add a test for the fp8 kv cache

a523479

github-actions bot added the testing label Dec 7, 2025

createthis added 4 commits December 7, 2025 03:15

Test passing

49a0651

LLAMA_DEEPSEEK32_FP8_K=1 env var and wiring up.

5112464

Add the FP8 pack custom op hook and replacing the unsafe pointer‑base…

e952efa

…d wiring.

Add GGML_OP_KV_DSMLA_PACK

3f17d34

github-actions bot added ggml Nvidia GPU labels Dec 7, 2025

createthis changed the title ~~Experimental FP8 KV Cache. Not wired into anything yet.~~ Experimental FP8 KV Cache. Dec 7, 2025

createthis changed the title ~~Experimental FP8 KV Cache.~~ Experimental FP8 KV Cache Dec 7, 2025

FP8 K is inferring again.

78da439

createthis closed this Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental FP8 KV Cache #25

Experimental FP8 KV Cache #25

Uh oh!

createthis commented Dec 6, 2025 •

edited

Loading

Uh oh!

createthis commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Experimental FP8 KV Cache #25

Experimental FP8 KV Cache #25

Uh oh!

Conversation

createthis commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

createthis commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

createthis commented Dec 6, 2025 •

edited

Loading