Mamba2 SSD #16982

gabe-l-hart · 2025-11-03T22:40:34Z

DRAFT STATUS

This PR will remain in Draft until the items in the discussion section are resolved.

Description

This PR is a draft implementation of the Structured Statespace Duality described in the original mamba2 paper which reframes the SSM_SCAN op as a pseudo-attention operation. The paper describes it in great detail, but the short version is that when performing a multi-token scan, the recurrent formulation of SSM_SCAN is inefficient because it cannot parallelize over the sequence dimension the way an attention calculation can. With the SSD formulation, the logical attention matrix is decomposed into chunks and the state is updated at the chunk boundaries, allowing prefill to "jump" by the size of the chunk rather than proceed with tokens one-at-a-time.

Reference Links

Original Paper: https://arxiv.org/pdf/2405.21060
Optimized triton implementation by paper authors: https://github.com/state-spaces/mamba/blob/main/mamba_ssm/ops/triton/ssd_combined.py
Unified implementation in mlx-lm: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/ssm.py

Changes

Introduce new primitive operations in ggml:
- ggml_cumsum / ggml_cumsum_0: Perform a cumulative sum along a give dimension
  - NOTE: This adds the ability to specify dimension on top of the implementation in metal: TRI, FILL, EXPM1, SOFTPLUS #16623 and relaxes the need for contiguous rows
- ggml_tri_dims / ggml_tri / ggml_tri_keep: Apply a triangular mask to the given matrix
  - NOTE: This adds the ability to specify dimension on top of the implementation in metal: TRI, FILL, EXPM1, SOFTPLUS #16623 and relaxes the need for contiguous rows with ggml_tri_dims
- ggml_softplus: Perform the unary softplus operation
Implement an alternate path through llm_graph_context_mamba::build_mamba2_layer when a multi-token update is detected
- This path is the core of the SSD implementation and avoids calling SSM_SCAN in favor of the chunked pseudo-attention formulation

Discussion

There are a number of outstanding discussion points on this work that need to be resolved before moving it forward:

Performance: Currently, this implementation appears to be significantly slower than simply using SSM_SCAN which roundly defeats the purpose of the change! I suspect that the performance issues are due to the number of ggml_permute / ggml_cont ops that are added to the graph, but could use assistance figuring out how to eliminate them or identifying other sources of slowness.
To chunk or not to chunk: In this PR I have sub-ubatch chunking implemented. I had it mostly working before the corresponding discussion on Qwen3Next. The inter-chunk update would be needed anyway, so I didn't strip it out, but it would be fairly trivial to do so and might offer some performance improvements.
Handling of repeat_interleave: Similar to the issue that came up when initially implementing NemotronH support, I believe that ggml_repeat behaves differently than mx.repeat, resulting in incorrect results for models with n_groups > 1 (tested with NemotronH).

Testing

I've tested this locally with various members of the Granite 4 family and with nvidia/NVIDIA-Nemotron-Nano-9B-v2. For the Granite 4 models with n_groups == 1, I get nearly identical results to running with purely SSM_SCAN, but NemotronH still struggles due to repeat_interleave issues (see above). I'll flesh out more testing results once we've worked through some of the above issues.

cc @compilade since I know this has been on your TODO list since the original mamba2 implementation.

pwilkin · 2025-11-03T23:04:34Z

Yeah, I had an issue with repeat_interleave too. Technically, repeat_interleave is equivalent to permute + repeat, but of course it introduces additional operations.

pwilkin · 2025-11-03T23:06:59Z

Regarding the chunking: won't this explode the graph a lot?

In case of Delta Net attention, since you have to use triangular solve there, you don't want the chunk size over 64 or performance drops drastically. But that means that you're going to go up to 8 chunks for a typical ubatch size of 512.

The graph for Qwen3 Next already has 9000 nodes. I'm a bit afraid of doing chunking this way (and I know @ggerganov had strong objections too).

gabe-l-hart · 2025-11-03T23:12:07Z

chunking: won't this explode the graph a lot?

Yep, it sure will. I also suspect this as one of the reasons this is slower currently. I don't think SSD has the same need for chunking based on computational complexity, so I think it's mostly there for memory overhead management.

examples/eval-callback/eval-callback.cpp

examples/gguf/gguf.cpp

gabe-l-hart · 2025-11-05T17:00:33Z

I've pulled the changes to llama-gguf (#17025), llama-eval-callback (#17028), and test-backend-ops (#17029) into separate PRs and will plan to update this PR once they're reviewed.

gabe-l-hart · 2025-11-05T19:16:23Z

I've been further experimenting with a few tweaks to get more performance out of this.

Add F16 and BF16 support to SSM_CONV so that we can reduce the precision of ssm_conv1d.weight when converting to GGUF
- This one seems to have a small but noticeable perf boost both with and without the SSD version of SSM_SCAN
Support F16 cache types for both r and s recurrent caches
- This requires a ggml_cast before calling the single-token ggml_ssm_scan and a cast back since the output is the merge of x and next_state
- For the SSD formulation, doing the cast only at the end gives a noticeable performance improvement
Remove sub-ubatch batching
- This one gives a fairly significant performance improvement getting the SSD version close to raw SSM_CONV.

Local notes using variants of the following command:

./bin/llama-batched-bench -m ~/models/ibm-granite/granite-4.0-h-1b/granite-4.0-h-1B-BF16-exp.gguf -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99

NOTE: granite-4.0-h-1B-BF16-exp.gguf includes my changes to allow ssm_conv1d.weight to be quantized as low as F16 during llama-quantize, so in this model, it's BF16 instead of FP32.

Baseline SSM_SCAN

F16 cache w/ F32 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.110	1167.11	2.777	46.09	2.887	88.67
128	128	2	512	0.193	1328.97	3.187	80.34	3.379	151.51
128	128	4	1024	0.369	1386.92	4.149	123.41	4.518	226.65
256	128	1	384	0.191	1343.18	2.781	46.03	2.971	129.23
256	128	2	768	0.365	1402.74	3.149	81.29	3.514	218.54
256	128	4	1536	0.737	1388.79	4.150	123.37	4.887	314.27

F32 cache / F32 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.109	1173.59	2.767	46.26	2.876	89.01
128	128	2	512	0.192	1331.47	3.193	80.18	3.385	151.25
128	128	4	1024	0.370	1384.20	4.204	121.78	4.574	223.87
256	128	1	384	0.190	1345.33	2.734	46.81	2.925	131.30
256	128	2	768	0.365	1401.15	3.206	79.85	3.572	215.03
256	128	4	1536	0.737	1389.54	4.245	120.60	4.982	308.30

F16 cache w/ BF16 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.109	1176.62	2.786	45.94	2.895	88.42
128	128	2	512	0.191	1342.10	3.106	82.41	3.297	155.28
128	128	4	1024	0.364	1406.48	4.080	125.48	4.444	230.40
256	128	1	384	0.190	1349.40	2.796	45.77	2.986	128.60
256	128	2	768	0.363	1412.04	3.198	80.06	3.560	215.71
256	128	4	1536	0.731	1401.63	4.093	125.09	4.824	318.43

With SSD

F16 cache w/ F32 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.134	957.38	2.852	44.89	2.985	85.76
128	128	2	512	0.238	1073.98	3.241	79.00	3.479	147.17
128	128	4	1024	0.449	1141.23	4.149	123.41	4.598	222.73
256	128	1	384	0.256	999.61	2.820	45.40	3.076	124.85
256	128	2	768	0.512	1000.77	3.223	79.43	3.735	205.64
256	128	4	1536	0.897	1141.60	4.130	123.97	5.027	305.54

F32 cache / F32 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.132	972.95	2.843	45.03	2.974	86.07
128	128	2	512	0.237	1079.11	3.174	80.65	3.411	150.09
128	128	4	1024	0.448	1141.80	4.211	121.58	4.660	219.76
256	128	1	384	0.255	1004.52	2.807	45.59	3.062	125.40
256	128	2	768	0.511	1001.48	3.202	79.94	3.714	206.81
256	128	4	1536	0.897	1141.03	4.201	121.87	5.099	301.26

F16 cache w/ BF16 conv

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.130	985.88	2.840	45.07	2.970	86.21
128	128	2	512	0.236	1084.81	3.195	80.12	3.431	149.22
128	128	4	1024	0.443	1155.40	4.150	123.37	4.593	222.94
256	128	1	384	0.252	1015.24	2.820	45.39	3.072	125.00
256	128	2	768	0.507	1009.49	3.219	79.52	3.726	206.10
256	128	4	1536	0.891	1149.51	4.101	124.84	4.992	307.69

F16 cache w/ BF16 conv and SSD cast at end

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.125	1027.02	2.810	45.56	2.934	87.25
128	128	2	512	0.228	1122.36	3.191	80.23	3.419	149.75
128	128	4	1024	0.492	1039.70	4.145	123.53	4.637	220.82
256	128	1	384	0.247	1037.08	2.814	45.49	3.061	125.46
256	128	2	768	0.502	1020.66	3.232	79.22	3.733	205.71
256	128	4	1536	0.981	1043.44	4.089	125.22	5.070	302.96

F16 cache w/ BF16 conv, SSD cast at end, and no sub-ubatch batching

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.117	1093.26	2.754	46.48	2.871	89.17
128	128	2	512	0.211	1213.76	3.251	78.75	3.462	147.90
128	128	4	1024	0.412	1243.73	4.161	123.04	4.573	223.92
256	128	1	384	0.230	1113.91	2.806	45.61	3.036	126.48
256	128	2	768	0.467	1097.32	3.184	80.41	3.650	210.39
256	128	4	1536	0.817	1252.68	4.110	124.59	4.927	311.75

ggerganov · 2025-11-05T21:40:28Z

To chunk or not to chunk

Probably you have to use large ubatch and do some chunking in order to get some benefits from the SSD. But I don't have a good estimate about what the optimal sizes would be.

At the default ubatch of 512, you can do the following experiment on master:

make -j && ./bin/llama-bench -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -fa 1 -t 1 -p 2048 -ub 512 -n 0

model	size	params	backend	threads	fa	test	t/s
granitehybrid 1B Q8_0	6.88 GiB	6.94 B	Metal,BLAS	1	1	pp2048	2116.14 ± 16.32
build: `230d116` (6962)

Now make the ssm scan a noop and run the test again:

diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index 424c400f2..7881e63e0 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -2129,6 +2129,7 @@ kernel void kernel_ssm_scan_f32(
         ushort  tiisg[[thread_index_in_simdgroup]],
         ushort  sgptg[[simdgroups_per_threadgroup]],
         uint3    tgpg[[threadgroups_per_grid]]) {
+    return;
     constexpr short NW = N_SIMDWIDTH;
 
     shared[tpitg.x] = 0.0f;

model	size	params	backend	threads	fa	test	t/s
granitehybrid 1B Q8_0	6.88 GiB	6.94 B	Metal,BLAS	1	1	pp2048	2699.35 ± 1.87
build: `230d116` (6962)

This is the upper bound that you would get at this ubatch size. I.e. any SSD implementation will not be faster than this.

Increasing the ubatch size increases the gap, so it gives more room for a good SSD implementation to outperform the ssm scan.

In any case, first step seems to be to reduce the amount of ops, permutations, conts in the SSD branch as much as possible.

gabe-l-hart · 2025-11-06T19:53:13Z

Now make the ssm scan a noop and run the test again:

🤦 I feel really silly for not figuring this trick out. I've been snipping out chunks of the graph and trying to coerce the input/output tensors to the same shape to simulate this upper bound part!

Probably you have to use large ubatch and do some chunking in order to get some benefits from the SSD. But I don't have a good estimate about what the optimal sizes would be.

This makes a lot of sense! I'll do some large-ubatch experiments to see if the current code may already be at a cross-over point where SSD can start offering better performance with larger batches. The speed advantages are very much supposed to be primarily felt at longer context which likely also means longer ubatches.

gabe-l-hart · 2025-11-06T20:00:13Z

It looks like the current code is not there yet and in fact starts to degrade further when ubatch size gets bigger (eg 1024) on my machine at least.

ggerganov · 2025-11-06T20:03:45Z

I guess it's expected since it does not have the chunking logic.

pwilkin · 2025-11-06T21:09:22Z

@gabe-l-hart you can look at the discussion in the Qwen3 Next thread, but basically, if you ever use the recurrent update logic aka solve_triangular, you must chunk because, unlike almost all the operations used so far, solve_triangular is of O(n^3) complexity. That means that performance will degrade rapidly with bigger chunks. For the Qwen3 Next model, I was unable to compute a single chunk in reasonable time with ubatch size 512, while with ubatch size 64 (equal to the chunk size for the reference implementation) it was reasonably fast.

gabe-l-hart · 2025-12-04T22:37:16Z

Now that we've got the underlying ops merged, I've redone the core SSD changes here. It's still quite a bit slower than with SSM_SCAN, so it still needs optimization work.

It builds but doesn't run yet Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

pwilkin · 2025-12-06T01:10:24Z

@gabe-l-hart regarding max-nodes you might want to also look at #17794

gabe-l-hart requested a review from compilade November 3, 2025 22:40

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 3, 2025

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16982: Mamba2 SSD auroralabs-loci/llama.cpp#56

Closed

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16982: Mamba2 SSD auroralabs-loci/llama.cpp#63

Open

github-actions bot added the model Model specific label Nov 5, 2025

gabe-l-hart mentioned this pull request Nov 5, 2025

examples(gguf): GGUF example outputs #17025

Merged

gabe-l-hart commented Nov 5, 2025

View reviewed changes

examples/eval-callback/eval-callback.cpp Outdated Show resolved Hide resolved

examples/gguf/gguf.cpp Outdated Show resolved Hide resolved

examples/gguf/gguf.cpp Outdated Show resolved Hide resolved

This was referenced Nov 5, 2025

examples(eval-callback): Eval callback verbosity #17028

Open

tests(test-backend-ops): Test backend ops verbosity #17029

Open

gabe-l-hart force-pushed the Mamba2SSD branch from 4435600 to d2779ae Compare December 4, 2025 22:28

gabe-l-hart added 2 commits December 5, 2025 15:22

feat: First-pass at porting SSD impl from previous work

eb4e399

It builds but doesn't run yet Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Increase max nodes for models known to use mamba2

8ba4d39

Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the Mamba2SSD branch from d2779ae to 8ba4d39 Compare December 5, 2025 22:23

Mamba2 SSD #16982

Are you sure you want to change the base?

Mamba2 SSD #16982

Uh oh!

Conversation

gabe-l-hart commented Nov 3, 2025

DRAFT STATUS

Description

Reference Links

Changes

Discussion

Testing

Uh oh!

pwilkin commented Nov 3, 2025

Uh oh!

pwilkin commented Nov 3, 2025

Uh oh!

gabe-l-hart commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Baseline SSM_SCAN

F16 cache w/ F32 conv

F32 cache / F32 conv

F16 cache w/ BF16 conv

With SSD

F16 cache w/ F32 conv

F32 cache / F32 conv

F16 cache w/ BF16 conv

F16 cache w/ BF16 conv and SSD cast at end

F16 cache w/ BF16 conv, SSD cast at end, and no sub-ubatch batching

Uh oh!

ggerganov commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

ggerganov commented Nov 6, 2025

Uh oh!

pwilkin commented Nov 6, 2025

Uh oh!

gabe-l-hart commented Dec 4, 2025

Uh oh!

pwilkin commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants