Skip to content

Conversation

@createthis
Copy link
Owner

@createthis createthis commented Dec 6, 2025

Chose to use 656 byte layout similar to VLLM, rather than a more ggml friendly layout, in the hope of better performance with modified tilelang and VLLM kernels.

This is a beautiful implementation of an FP8 MLA latent KV Cache. Unfortunately, that wasn't what I intended to build at all. What I intended to build was a llama.cpp clone of VLLM's DeepseekV32IndexerCache.

I'm still just seeing 5.6 to 5.8 tok/s because the main kernels are not actually consuming the new FP8 K cache yet.

Here it is with LLAMA_SPARSE_PROF=1 LLAMA_DEEPSEEK32_FP8_K=1 getting 5.6 tok/s:

[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.249 over 50 calls                                           
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.256 over 50 calls                                                             
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.346 over 50 calls                                                                
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.241 over 50 calls

And here is is with LLAMA_SPARSE_PROF=1 LLAMA_SPARSE_TOPK_TL=1 LLAMA_DEEPSEEK32_FP8_K=1 getting 5.8 tok/s:

[PROFILE_WMMA_HGRP_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=0.249 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=0.256 over 50 calls
[PROFILE_TL_ONLY] TILELANG_TOPK N=163840 T=1 k=256 avg_ms=0.019 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.057 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.242 over 50 calls

And finally, because the tilelang indexer is a toy, way less than 1 tok/s with LLAMA_SPARSE_PROF=1 LLAMA_INDEXER_TL_PORT=1 LLAMA_SPARSE_TOPK_TL=1 LLAMA_DEEPSEEK32_FP8_K=1:

[PROFILE_TL_ONLY] TILELANG_INDEXER D=128 H=64 Tc=1 kv=163840 avg_ms=535.981 over 50 calls
[PROFILE] IDX_TILE CUDA D=128 H=64 Tc=1 kv=163840 avg_ms=536.282 over 50 calls
[PROFILE_TL_ONLY] TILELANG_TOPK N=163840 T=1 k=256 avg_ms=0.019 over 50 calls
[PROFILE] SPARSE_TOPK_RADIX N=163840 T=1 k=256 avg_ms=0.061 over 50 calls
[PROFILE] SPARSE_MLA_DECODE D=576 Hq=128 Hkv=1 Dv=512 Nkv=163840 K=256 avg_ms=0.240 over 50 calls

@createthis createthis self-assigned this Dec 6, 2025
@createthis createthis changed the title Experimental FP8 KV Cache. Not wired into anything yet. Experimental FP8 KV Cache. Dec 7, 2025
@createthis createthis changed the title Experimental FP8 KV Cache. Experimental FP8 KV Cache Dec 7, 2025
@createthis
Copy link
Owner Author

Closing in favor of #26

@createthis createthis closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants