DeepSeek-V3.2-Exp #9

createthis · 2025-10-01T00:12:29Z

Don't merge. WIP.

instead always view the full kv_size: - Before: returned [D_index, n_kv, ns] where n_kv could be 256 during decode with flash-attn. - Now: always returns [D_index, kv_size, ns]. - If multiple streams are active, the code reshapes to merge the stream dimension and then returns a 3D view to keep the KV axis contiguous for the indexer: - Ensures [D_index, kv_size, ns] semantics regardless of stream count.

performance reasons.

…nto deepseek_v3_2_exp

kq_mask.

remove any score-shaping bias (e.g., ALiBi) from that step. Apply ALiBi only in the final attention computation.

tilelang implementations.

compute_indexer_triplet: lower epsilon in K-indexer normalization

by ensuring dStarts and dEnds are populated.

itself, compiled to cuda c. Note that this creates a dependency on cutlass, so this may forever stay on my private branch.

unit test is failing. Working on it.

fp8 in cuda too. Still not working though.

src/llama-sparse-indexer.cpp where it belongs.

idx_compute_scores_tile ) with the k_indexer_logits_tiled_f32 kernel. In the process, identified and fixed two bugs in idx_compute_scores_tile. Next step is to hopefully use this new test as a pattern to properly unit test the tilelang kernel.

that our FP8 CPU code is correct.

**File:** `ggml/src/ggml-fp8.cpp` **Function:** `template<int E> inline float fp8_to_float(const FP8<E>& in)` Problem: For E4M3 (E=4), GGML’s generic decoder treated all 7 value bits as a finite encoding, so codes `0x7f` and `0xff` decoded to `+4 80` and `-480` instead of NaN. CUTLASS’s `float_e4m3_t` (and the CPU helper in `tests/fp8-e4m3-cpu.h`) interpret the pattern: - exponent = 0xF (all 4 exponent bits) - mantissa = 0x7 (all 3 mantissa bits) as **NaN**, independent of sign, i.e. for both `0x7f` and `0xff`.

GGML_CUDA_DISABLE_GRAPHS=1 \ LLAMA_SPARSE_PROF=1 \ LLAMA_SPARSE_PROF_EACH=1 \ LLAMA_INDEXER_TL_FP8_DEBUG=1 \ LLAMA_INDEXER_TL_PORT=1 \ LLAMA_TL_FP8=1 \ ./scripts/debug-test.sh test-indexer-fused-op-cuda 0 passes. This compares the tilelang vendored kernel with out CPU reference kernel, `idx_compute_scores_tile`.

tilelang vendored kernel mathematically. Slightly slower. 4.82 tok/s vs 6.0 tok/s on my hardware.

accuracy.

restores a fair bit (if not all) of the lost performance: - The inner-most `d` loop does only `dot += qv[d] * kvp[d];` - All FP8 work has been hoisted into the Qq/Kh precomputation loops, which are O(D * H * Tc + D * kv) instead of O(D * H * Tc * kv).

…nternally and matches the CPU FP8 reference, so the H=4 WMMA path test passes

… in the optimized path) now: - Uses FP8 E4M3 quantization for Q and K, - Uses the same per-row K scale `K_sf = amax/448` as the CPU reference, - Uses `k_scale * K_sf` as the final scaling, and - Matches the CPU `idx_compute_scores_tile` FP8 Lightning Indexer to ~3e-6 for the test shape.

the timings.

createthis self-assigned this Oct 1, 2025

github-actions bot added the python label Oct 1, 2025

This comment was marked as off-topic.

Sign in to view

github-actions bot added testing ggml Nvidia GPU labels Oct 13, 2025

github-actions bot added the examples label Oct 24, 2025

createthis added 23 commits October 24, 2025 11:47

Add cb logging for indexer_k_cache_head

782c946

More cb logs for indexer

1d5a878

arguing with sentient rocks

c761a8f

Fix crash (hopefully)

e53b2ed

Another fix

5ef5a53

Add clamp and get_k_full/get_v_full (for future use, apparently)

9bbd46e

Fix warning

d38859a

Hide sparse attention logging behind env var LLAMA_SPARSE_DEBUG for

3d862a2

performance reasons.

Don't modify get_k_indexer. Create get_k_indexer_full instead.

5194bfa

Fix crash (hopefully)

73b35db

Switch to using get_k_indexer_full

472a41d

Fix crash

be3ef9e

Fix crash

04bb17f

Build and use full kq_mask for sparse attention.

0500521

Merge branch 'deepseek_v3_2_exp' of github.com:createthis/llama.cpp i…

aec054d

…nto deepseek_v3_2_exp

Trying to track down why we aren't always getting the full width

ba4780c

kq_mask.

Fix compile error

16a9d43

Attempt to fix crash

568c5e3

Remove rotate_activation.

6097468

Keep validity masking (causal window) in the indexer/top-k path, but

ef9b177

remove any score-shaping bias (e.g., ALiBi) from that step. Apply ALiBi only in the final attention computation.

Bump LLAMA_SPARSE_TOPK default to 2048 to be inline with vllm and

48ba041

tilelang implementations.

set_input_kq_mask_full_2d: restore ALiBi in the full-width mask

d2ac39e

compute_indexer_triplet: lower epsilon in K-indexer normalization

createthis added 30 commits November 16, 2025 20:52

Restore performance for PROFILE_TL_ONLY and PROFILE_TL_TMA_FP8_KONLY

f1964b8

by ensuring dStarts and dEnds are populated.

Remove unused functions.

7d6fa9c

Vendor the tilelang fp8 indexer kernel. Not a port. The kernel

f1a567f

itself, compiled to cuda c. Note that this creates a dependency on cutlass, so this may forever stay on my private branch.

Rename env var

53635b0

Change profile name

e0ff9d6

Wire up the tilelang lightning indexer. Something is wrong though as the

576871d

unit test is failing. Working on it.

Remove unused k_tl_mqa_attn_return_logits_tma_fp8_full

1042e34

Attempt to fix test by making fp8like cpu reference and correctly doing

c46ba47

fp8 in cuda too. Still not working though.

Test passes, but only with these specific settings.

11b757a

Move idx_compute_scores_tile() out of src/llama-sparse-topk.cpp into

2574f0b

src/llama-sparse-indexer.cpp where it belongs.

Add test for CPU indexer. Currently failing.

c404075

Add test/test-fp8-e4m3-cutlass-vs-cpu.cpp so we can have confidence

f65ee12

that our FP8 CPU code is correct.

Yoink FP8 code from ggml-org#10055

f72c04e

Wire GGML fp8 into our unit test. Currently failing.

71d3b73

Fix warnings

481e4bd

Random test changes

7c63a89

Make the FP8 behavior the only behavior.

729e044

WMMA HGRP kernel changed to use FP8 internally. Output now matches

d25442a

tilelang vendored kernel mathematically. Slightly slower. 4.82 tok/s vs 6.0 tok/s on my hardware.

Restore 6.23 tok/s performance in WMMA HGRP kernel while retaining FP8

82ed1e6

accuracy.

Add LLAMA_SPARSE_PROF profiling to the idx_compute_scores_tile CPU path.

3350c3c

Remove old unused CPU graph building path from idx_compute_scores_tile.

d495b26

k_indexer_logits_wmma16_bf16 is now doing real NVIDIA FP8 E4M3 math i…

8431a4f

…nternally and matches the CPU FP8 reference, so the H=4 WMMA path test passes

Remove tests/fp8-e4m3-cpu.h and all references to it.

a5ae544

Remove unused use_fp16 argument to idx_compute_scores_tile

e68b55b

Add warmup to tests/test-indexer-fused-op-cuda.cpp so we don't pollute

7289478

the timings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek-V3.2-Exp #9

DeepSeek-V3.2-Exp #9

createthis commented Oct 1, 2025

Uh oh!

This comment was marked as off-topic.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DeepSeek-V3.2-Exp #9

Are you sure you want to change the base?

DeepSeek-V3.2-Exp #9

Conversation

createthis commented Oct 1, 2025

Uh oh!

This comment was marked as off-topic.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants