-
Notifications
You must be signed in to change notification settings - Fork 14k
Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @CISC and @NeoZhangJianyu, We’d appreciate it if you could review our PR implementing the new SPARSEK Attention operator. This contribution was developed jointly by both of us (@yael-works and @GittyBurstein ). Thanks in advance for your time and feedback! |
|
We are talking about this SparseK, right? |
|
yes! @CISC |
|
You need to rebase to fix Server CI failures, also please fix whitespaces: |
|
Hi @CISC, I’d really appreciate it if you could review the code itself so we can move forward with the merge — Thanks! |
Yes, as mentioned, will be resolved if you rebase, it's ok. :)
So, my main challenge is where/what/when will SparseK be used? I can't recall seeing any actual implementation being used in the wild. This also means we don't really have any reference to test it against... |
|
@CISC Once this PR is merged, the operator can be connected to higher-level use cases such as:
Thank you!! |
|
I think @ggerganov will have to weigh in on this. |
|
Sparse attention implementations such as DSA and SparseK should leverage the existing FA implementations and mask filtering logic. No need to introduce new operators and duplicate all the existing work that already went into optimizing FA. |
77f4088 to
22c063e
Compare
|
Hi @ggerganov and @CISC, |
16d7eee to
556ab36
Compare
|
Hi @ggerganov and @CISC, |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was more along the following lines:
- Sparse attention implementations should somehow compute a sparse KQ mask. Depending on the specifics (e.g. local windows, top-k product, deepseek lightning stuff, etc.) this can be done in different way, but generally it should require some extra logic when constructing the compute graph
- Then we pass the sparse KQ mask (i.e. a normal mask but with extra -INF values where we don't have to compute the attention) to
ggml_flash_attn_extand we delegate the filtering logic to the backend implementation. For example, the Metal backend will already skip large amount of the filtered values depending on the KQ mask contents (#16372). Similar or better logic can be added to the other backend implementations.
I think at most, the only change to the existing ggml_flash_attn_ext API would be to provide a "mask hint" that would inform the backend what kind of mask to expect (causal, sparse, etc.). End the rest of the changes should be at the compute graph level and at the backend implementation for filtering the -INF values. Let me know if this makes sense.
|
@ggerganov And if that’s the case, where exactly should the mask implementation be added — inside the compute graph logic, or only for testing (e.g., in test-backend-ops)? |
In llama.cpp, the mask is already being created and passed to llama.cpp/src/llama-kv-cache.cpp Lines 1223 to 1306 in afd3532
I think that the sparse attention implementations should augment this static mask through some extra logic. This extra logic should be implemented for example in the From there, the FA implementations will deal with the provided mask in their own way (i.e. by skipping computations when possible).
For testing, you can already take a look how we create KQ masks with blocks of -INF values here: llama.cpp/tests/test-backend-ops.cpp Lines 134 to 176 in afd3532
I imagine that we would need tests that create various sorts of sparse masks and simply run |
7c5f85a to
627bd45
Compare
|
Hi @ggerganov @NeoZhangJianyu @CISC Summary of Updates
Performance (CODEGEMMA — Sparse-K vs Baseline)
Additional Notes
Benchmark Command Used./build/bin/llama-cli \
-m /home/gitty/models/codegemma-sparsek-Q4_K_M.gguf \
-ngl 0 \
-c 256 \
-n 400 \
-t 8 |
It's good to see the improvement of this PR and shared info of the result. Could you update these info in the description of PR too? |
|
Thank you for the feedback! |
Co-authored-by: Gitty Burstein <@GittyBurstein> Co-authored-by: Yael Shuker <@yael-works>
|
Hi @ggerganov, |
As it should as it's skipping a lot of computation, however at what cost? It's far more interesting to know how much accuracy is lost in doing so. At the very least show some perplexity numbers.
Hopefully this will not sound too harsh, but this PR is a bit of a broken mess right now. It seems most attempts to steer this in the right direction leads to backtracking on earlier course corrections. If we disregard the minor issues for now, your main issue is that you have a test that not only does not test what you are implementing, but also crashes on every single CI. |
|
@CISC I have the impression that you are reviewing AI generated code. Sincerely. |
Oh, I know I am, that's fine though. It's the human errors I'm concerned about. |
|
@yael-works And, please don't change the UT cases of other OPs. Thank you! |
|
@NeoZhangJianyu @CISC @ggerganov We updated the test-backend-ops file with a more precise and limited adjustment that applies only to our SparseK-related cases, without affecting any other operator. Accuracy testingWe spent a significant amount of time trying to validate output equivalence between runs, but this approach turned out to be extremely difficult. For example, we ran the model once with SparseK enabled and once without it. Because of this inherent variability, we currently don’t see a practical way to construct deterministic correctness tests for generative outputs. What our backend tests verifyOur backend tests focus on:
As part of this, we explicitly marked all SparseK test cases as NOT_SUPPORTED on Vulkan backends. |
If you do not provide a static seed (using |
Or better yet just force greedy decoding with |
I think that is not proper greedy decoding in llama.cpp rather ... --sampling-seq k --top-k 1 Reference: |
Accuracy Validation for the SparseK IntegrationBelow is a full summary of the controlled tests conducted to ensure that adding SparseK does not affect output correctness. Testing MethodologyWe performed parallel runs of the baseline model and the SparseK-enhanced model, using identical parameters:
--temp 0
-c 4096
--simple-io Prompts UsedWe tested three very long technical prompts (3000–4000 words), chosen to generate large, stable, and highly comparable outputs. Similarity ResultsTo compare outputs, we used the SequenceMatcher similarity metric.
SummaryAcross all evaluations — the output remains almost unchanged, indicating that integrating SparseK does not compromise correctness. |
|
Hi @ggerganov, Sparse-K: Usage, Accuracy Impact, and Operational Notes1. OverviewSparse-K has been reintegrated into the Flash Attention architecture via a dynamic mask, eliminating the need for a dedicated operator. 2. Enabling Sparse-K
Example Run Command./build/bin/llama-cli \
-m /path/to/model/codegemma-sparsek-Q4_K_M.gguf \
-ngl 0 \
-c 256 \
-n 400 \
-t 8Optional Deterministic Settings (for accuracy testing)--seed 1234
--temp 0
--sampling-seq k
--top-k 1
--simple-io
--no-cnv
--ignore-eos3. Parameter Source
4. Performance ImprovementsSparse-K provides a dramatic speedup in Prompt Evaluation Example: CODEGEMMA model
5. Accuracy ImpactParallel runs with and without Sparse-K, using the same parameters, fixed seed, and deterministic settings, show high similarity. Similarity by SequenceMatcher
ConclusionOutput is nearly identical. 6. Backend Tests Performed
|
New Attention Mechanism: SparseK Dynamic Attention (CPU, Graph-Level Prototype)
PR Description —
This PR integrates an experimental SparseK dynamic attention mechanism into the llama.cpp compute graph on the CPU execution path, built entirely from existing GGML operations.
No new GGML operator or low-level CPU kernels are introduced in this PR.
The purpose of this PR is to establish correct graph logic, metadata handling, and test coverage, before adding optimized kernels or GPU/SYCL support in a follow-up PR.
Overview
SparseK introduces selective sparsity into attention using:
At runtime, SparseK refines the base KQ mask (causal / cross / SWA) by selectively allowing only the strongest or relevant attention positions.
What This PR Actually Implements
1. Graph-level SparseK mask building
Files:
llama-graph.cpp,llama-graph.hThis PR adds two new graph functions:
build_sparsek_mask(q, k, base_mask, il)maybe_apply_sparsek_mask(base_mask, q, k, n_kv, n_rows, n_stream, il)Implemented behavior:
ggml_mul_mat(k, q)ggml_top_kggml_get_rowsggml_set_rowsggml_flash_attn_extonce with the final maskImportant
2. SparseK metadata & hyperparameters
Files:
llama-hparams.h,llama-model.cpp,llama-model-loader.cppAdded optional HParams:
sparsek_enablesparsek_topksparsek_windowsparsek_strideThese are read from GGUF keys if present:
llama.sparsek.enablellama.sparsek.top_kllama.sparsek.windowllama.sparsek.strideRuntime graph receives these values and applies them consistently.
3. HF → GGUF converter support
File:
convert_hf_to_gguf.pyIf the HF config provides SparseK parameters, they are written into GGUF:
llama.sparsek.enablellama.sparsek.top_kllama.sparsek.windowllama.sparsek.strideThis enables end-to-end metadata flow for models that choose to expose SparseK defaults.
4. Backend tests
File:
tests/test-backend-ops.cppAdded deterministic test:
test_sparsek_kq_maskThis validates correctness of the mask-building pipeline using:
ggml_new_tensor_*ggml_reshape_3dggml_get_rowsggml_set_rowsggml_reshape_2dRegistered in
make_test_cases_eval()so CI covers the mask logic.Co-Authors
Co-authored-by: Yael Shuker (yaelshuker100@gmail.com)
Co-authored-by: Gitty Burstein (g0534163997@gmail.com)