CUDA: generalized (mma) FA, add Volta support #17505

JohannesGaessler · 2025-11-25T20:45:36Z

This PR makes the following changes to the CUDA FlashAttention code:

All kernels have been extended with support for attention masks that are not padded in mask->ne[1] direction. This is done by applying a modulo on the mask column that is being read so no conditional statements need to be evaluated. The impact on performance is negligible and I do not deem it necessary to compile additional template specializations. See ggml : remove KQ mask padding #16309 . cc @ggerganov .
The mma kernel has been extended with support for Volta tensor cores. Previously the WMMA kernel was used. The WMMA kernel is now only needed for AMD. After AMD support has been added to the mma kernel the WMMA kernel can be safely removed, leaving only 3 kernels to maintain going forward. On master the mma kernel has defects w.r.t. tile shapes that do not manifest as bugs, those should be fixed with this PR and I think it is now feasible for other developers to add support for e.g. AMD wmma instructions. cc @zhang-hui-yulo @jiachengjason @unverbraucht .
The tile template in mma.cuh has been extended with additional, optional arguments to safely handle situations where tiles of the same shape can have different physical data layouts.
The mma kernel is refactored to allow more flexible configuration. The configuration is now also done without the use of templating which seems to be causing issues for __launch_bounds__ when using ROCm (as of right now ROCm is not used).
The mma kernel is extended with support for out-of-bounds checks in direction of K->ne[1]. As with the tile kernel, because this comes at a cost to performance it is still preferable to pad the KV cache length. As of right now this is still required to be 256, for the currently supported GPUs it should be possible to lower this to 128 without issue once the WMMA kernel has been completely replaced. For Hopper it may still make sense to have a padding of 256 but as it is I have no idea whether the 256x64 instruction would actually have better performance than the 128x64 instruction.

As of right now the interface in mma.cuh is suboptimal and long-term I intend to refactor it to allow the use of tensor cores in a more uniform way. However, I don't know the exact requirements until we have proper support for AMD WMMA and AMD MFMA instructions. So for now I think the correct choice is to prioritize getting working support for those at the cost of maintainability and to do a refactor afterwards.

V100 performance

GPU	Model	Microbatch size	Test	t/s master	t/s `277014f`	Speedup
V100-PCIE-32GB	deepseek2 16B Q4_0	1	pp512@d32768	84.06	89.23	1.06
V100-PCIE-32GB	deepseek2 16B Q4_0	2	pp512@d32768	88.28	86.50	0.98
V100-PCIE-32GB	deepseek2 16B Q4_0	4	pp512@d32768	122.04	134.50	1.10
V100-PCIE-32GB	deepseek2 16B Q4_0	8	pp512@d32768	159.61	204.43	1.28
V100-PCIE-32GB	deepseek2 16B Q4_0	16	pp512@d32768	187.50	274.82	1.47
V100-PCIE-32GB	deepseek2 16B Q4_0	32	pp512@d32768	208.08	340.50	1.64
V100-PCIE-32GB	deepseek2 16B Q4_0	64	pp512@d32768	196.49	312.07	1.59
V100-PCIE-32GB	deepseek2 16B Q4_0	128	pp512@d32768	217.64	371.18	1.71
V100-PCIE-32GB	deepseek2 16B Q4_0	256	pp512@d32768	227.55	408.51	1.80
V100-PCIE-32GB	deepseek2 16B Q4_0	512	pp512@d32768	250.76	432.14	1.72
V100-PCIE-32GB	gemma 2B Q4_0	1	pp512@d32768	196.73	276.43	1.41
V100-PCIE-32GB	gemma 2B Q4_0	2	pp512@d32768	341.32	472.67	1.38
V100-PCIE-32GB	gemma 2B Q4_0	4	pp512@d32768	233.69	461.42	1.97
V100-PCIE-32GB	gemma 2B Q4_0	8	pp512@d32768	433.09	705.18	1.63
V100-PCIE-32GB	gemma 2B Q4_0	16	pp512@d32768	779.04	1095.12	1.41
V100-PCIE-32GB	gemma 2B Q4_0	32	pp512@d32768	981.00	1506.68	1.54
V100-PCIE-32GB	gemma 2B Q4_0	64	pp512@d32768	859.59	1260.66	1.47
V100-PCIE-32GB	gemma 2B Q4_0	128	pp512@d32768	1032.55	1735.64	1.68
V100-PCIE-32GB	gemma 2B Q4_0	256	pp512@d32768	1089.22	1833.70	1.68
V100-PCIE-32GB	gemma 2B Q4_0	512	pp512@d32768	995.95	1613.81	1.62
V100-PCIE-32GB	llama 1B Q4_0	1	pp512@d32768	237.92	323.72	1.36
V100-PCIE-32GB	llama 1B Q4_0	2	pp512@d32768	417.22	588.65	1.41
V100-PCIE-32GB	llama 1B Q4_0	4	pp512@d32768	448.34	838.65	1.87
V100-PCIE-32GB	llama 1B Q4_0	8	pp512@d32768	824.46	1445.37	1.75
V100-PCIE-32GB	llama 1B Q4_0	16	pp512@d32768	1435.92	1917.20	1.34
V100-PCIE-32GB	llama 1B Q4_0	32	pp512@d32768	1769.39	2566.43	1.45
V100-PCIE-32GB	llama 1B Q4_0	64	pp512@d32768	1991.61	2289.92	1.15
V100-PCIE-32GB	llama 1B Q4_0	128	pp512@d32768	2391.19	2843.04	1.19
V100-PCIE-32GB	llama 1B Q4_0	256	pp512@d32768	2312.60	2559.85	1.11
V100-PCIE-32GB	llama 1B Q4_0	512	pp512@d32768	1900.53	2137.76	1.12
V100-PCIE-32GB	llama 8B Q4_0	1	pp512@d32768	61.12	81.47	1.33
V100-PCIE-32GB	llama 8B Q4_0	2	pp512@d32768	115.57	154.44	1.34
V100-PCIE-32GB	llama 8B Q4_0	4	pp512@d32768	120.26	220.87	1.84
V100-PCIE-32GB	llama 8B Q4_0	8	pp512@d32768	215.88	323.48	1.50
V100-PCIE-32GB	llama 8B Q4_0	16	pp512@d32768	380.43	467.35	1.23
V100-PCIE-32GB	llama 8B Q4_0	32	pp512@d32768	470.78	656.82	1.40
V100-PCIE-32GB	llama 8B Q4_0	64	pp512@d32768	228.56	456.01	2.00
V100-PCIE-32GB	llama 8B Q4_0	128	pp512@d32768	278.85	670.43	2.40
V100-PCIE-32GB	llama 8B Q4_0	256	pp512@d32768	307.17	872.91	2.84
V100-PCIE-32GB	llama 8B Q4_0	512	pp512@d32768	314.34	932.41	2.97

Other GPU performance

GPU	Model	Microbatch size	Test	t/s master	t/s `e44ebb0`	Speedup
MI60 / MI50	llama 8B Q4_0	1	pp512@d32768	59.80	64.40	1.08
MI60 / MI50	llama 8B Q4_0	2	pp512@d32768	106.46	113.46	1.07
MI60 / MI50	llama 8B Q4_0	4	pp512@d32768	119.84	97.07	0.81
MI60 / MI50	llama 8B Q4_0	8	pp512@d32768	162.89	167.55	1.03
MI60 / MI50	llama 8B Q4_0	16	pp512@d32768	228.46	229.93	1.01
MI60 / MI50	llama 8B Q4_0	32	pp512@d32768	269.06	268.69	1.00
MI60 / MI50	llama 8B Q4_0	64	pp512@d32768	291.15	289.38	0.99
MI60 / MI50	llama 8B Q4_0	128	pp512@d32768	335.13	332.27	0.99
MI60 / MI50	llama 8B Q4_0	256	pp512@d32768	351.75	349.71	0.99
MI60 / MI50	llama 8B Q4_0	512	pp512@d32768	357.18	355.12	0.99
MI100	llama 8B Q4_0	1	pp512@d32768	77.78	82.66	1.06
MI100	llama 8B Q4_0	2	pp512@d32768	133.33	139.16	1.04
MI100	llama 8B Q4_0	4	pp512@d32768	164.44	169.21	1.03
MI100	llama 8B Q4_0	8	pp512@d32768	232.70	236.51	1.02
MI100	llama 8B Q4_0	16	pp512@d32768	424.09	431.27	1.02
MI100	llama 8B Q4_0	32	pp512@d32768	559.43	563.32	1.01
MI100	llama 8B Q4_0	64	pp512@d32768	648.34	648.77	1.00
MI100	llama 8B Q4_0	128	pp512@d32768	671.01	668.83	1.00
MI100	llama 8B Q4_0	256	pp512@d32768	696.50	692.00	0.99
MI100	llama 8B Q4_0	512	pp512@d32768	706.38	700.32	0.99
P40	llama 8B Q4_0	1	pp512@d32768	31.00	32.45	1.05
P40	llama 8B Q4_0	2	pp512@d32768	59.14	61.75	1.04
P40	llama 8B Q4_0	4	pp512@d32768	87.36	89.87	1.03
P40	llama 8B Q4_0	8	pp512@d32768	122.68	122.31	1.00
P40	llama 8B Q4_0	16	pp512@d32768	178.33	175.34	0.98
P40	llama 8B Q4_0	32	pp512@d32768	189.92	190.07	1.00
P40	llama 8B Q4_0	64	pp512@d32768	209.02	208.27	1.00
P40	llama 8B Q4_0	128	pp512@d32768	217.96	217.49	1.00
P40	llama 8B Q4_0	256	pp512@d32768	223.15	222.81	1.00
P40	llama 8B Q4_0	512	pp512@d32768	219.45	219.48	1.00
Radeon 8060S Graphics	llama 8B Q4_0	1	pp512@d32768	23.92	24.10	1.01
Radeon 8060S Graphics	llama 8B Q4_0	2	pp512@d32768	43.49	43.68	1.00
Radeon 8060S Graphics	llama 8B Q4_0	4	pp512@d32768	77.88	78.19	1.00
Radeon 8060S Graphics	llama 8B Q4_0	8	pp512@d32768	108.82	96.17	0.88
Radeon 8060S Graphics	llama 8B Q4_0	16	pp512@d32768	138.58	140.27	1.01
Radeon 8060S Graphics	llama 8B Q4_0	32	pp512@d32768	151.39	152.96	1.01
Radeon 8060S Graphics	llama 8B Q4_0	64	pp512@d32768	74.81	76.94	1.03
Radeon 8060S Graphics	llama 8B Q4_0	128	pp512@d32768	101.46	102.30	1.01
Radeon 8060S Graphics	llama 8B Q4_0	256	pp512@d32768	115.59	115.84	1.00
Radeon 8060S Graphics	llama 8B Q4_0	512	pp512@d32768	117.65	118.57	1.01
RTX 3090	llama 8B Q4_0	1	pp512@d32768	87.54	92.96	1.06
RTX 3090	llama 8B Q4_0	2	pp512@d32768	160.48	170.31	1.06
RTX 3090	llama 8B Q4_0	4	pp512@d32768	293.48	303.46	1.03
RTX 3090	llama 8B Q4_0	8	pp512@d32768	429.51	439.54	1.02
RTX 3090	llama 8B Q4_0	16	pp512@d32768	844.62	874.15	1.03
RTX 3090	llama 8B Q4_0	32	pp512@d32768	1184.30	1194.99	1.01
RTX 3090	llama 8B Q4_0	64	pp512@d32768	1491.70	1495.43	1.00
RTX 3090	llama 8B Q4_0	128	pp512@d32768	1612.42	1617.77	1.00
RTX 3090	llama 8B Q4_0	256	pp512@d32768	1716.96	1697.92	0.99
RTX 3090	llama 8B Q4_0	512	pp512@d32768	1470.93	1448.12	0.98
RTX 4090	llama 8B Q4_0	1	pp512@d32768	98.14	102.76	1.05
RTX 4090	llama 8B Q4_0	2	pp512@d32768	178.13	190.39	1.07
RTX 4090	llama 8B Q4_0	4	pp512@d32768	349.90	366.50	1.05
RTX 4090	llama 8B Q4_0	8	pp512@d32768	618.83	646.33	1.04
RTX 4090	llama 8B Q4_0	16	pp512@d32768	1095.54	1140.84	1.04
RTX 4090	llama 8B Q4_0	32	pp512@d32768	2007.89	2051.87	1.02
RTX 4090	llama 8B Q4_0	64	pp512@d32768	3091.16	3089.09	1.00
RTX 4090	llama 8B Q4_0	128	pp512@d32768	3188.55	3095.61	0.97
RTX 4090	llama 8B Q4_0	256	pp512@d32768	2961.18	2892.63	0.98
RTX 4090	llama 8B Q4_0	512	pp512@d32768	2464.56	2431.25	0.99
RTX 5090	llama 8B Q4_0	1	pp512@d32768	155.78	167.41	1.07
RTX 5090	llama 8B Q4_0	2	pp512@d32768	239.31	269.27	1.13
RTX 5090	llama 8B Q4_0	4	pp512@d32768	461.48	486.56	1.05
RTX 5090	llama 8B Q4_0	8	pp512@d32768	780.64	810.10	1.04
RTX 5090	llama 8B Q4_0	16	pp512@d32768	1381.19	1408.61	1.02
RTX 5090	llama 8B Q4_0	32	pp512@d32768	2253.55	2308.20	1.02
RTX 5090	llama 8B Q4_0	64	pp512@d32768	2827.63	2828.64	1.00
RTX 5090	llama 8B Q4_0	128	pp512@d32768	3009.14	3075.67	1.02
RTX 5090	llama 8B Q4_0	256	pp512@d32768	3078.24	2981.31	0.97
RTX 5090	llama 8B Q4_0	512	pp512@d32768	2698.04	2640.36	0.98
RX 6800	llama 8B Q4_0	1	pp512@d32768	42.25	44.60	1.06
RX 6800	llama 8B Q4_0	2	pp512@d32768	77.43	81.42	1.05
RX 6800	llama 8B Q4_0	4	pp512@d32768	105.08	108.86	1.04
RX 6800	llama 8B Q4_0	8	pp512@d32768	140.43	140.94	1.00
RX 6800	llama 8B Q4_0	16	pp512@d32768	173.28	175.32	1.01
RX 6800	llama 8B Q4_0	32	pp512@d32768	209.55	210.72	1.01
RX 6800	llama 8B Q4_0	64	pp512@d32768	235.46	235.80	1.00
RX 6800	llama 8B Q4_0	128	pp512@d32768	262.63	262.85	1.00
RX 6800	llama 8B Q4_0	256	pp512@d32768	274.40	274.65	1.00
RX 6800	llama 8B Q4_0	512	pp512@d32768	275.25	274.63	1.00
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	25.67	29.58	1.15
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	49.98	57.25	1.15
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	85.18	97.39	1.14
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	111.87	104.18	0.93
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	162.98	172.35	1.06
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	190.29	195.63	1.03
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	288.59	291.34	1.01
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	322.67	325.96	1.01
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	348.31	351.01	1.01
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	349.45	350.95	1.00

The performance numbers assume that the KQ mask is no longer being padded. This change is also in this PR. I don't have a good overview of which other backends maybe still need support for this change and whether or not it should be reverted prior to merging.

zhang-hui-yulo · 2025-11-26T08:54:07Z

Thank you for the info, I shall work on FA for RDNA4 once this PR is merged. Looks like that the logic of transposed tile is still empty.

Hedede · 2025-11-29T17:16:15Z

Testing the performance: prefill performance is greatly improved, however, TG is slower. I think it's better to use BEST_FATTN_KERNEL_VEC for tg.

On the master branch (7d2add5):

./build-volta/bin/llama-bench -m /models/llm/llama/llama-2-7b.Q4_0.gguf -fa 0,1 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3227.63 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	3138.40 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	2961.39 ± 4.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	2652.95 ± 2.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	2169.29 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	1542.83 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.04 ± 1.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	137.95 ± 1.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	133.65 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	133.28 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3118.69 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	2949.40 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	2652.39 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	2209.74 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	1655.14 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	1105.87 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	151.20 ± 1.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	150.62 ± 1.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	149.25 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	145.21 ± 0.04

With this PR merged:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3232.19 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	3141.40 ± 0.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	2967.57 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	2654.92 ± 0.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	2171.34 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	1543.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.45 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	139.05 ± 1.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	134.43 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	134.54 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3513.77 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	3492.63 ± 3.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	3414.79 ± 2.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	3274.16 ± 2.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	3005.68 ± 1.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	2561.95 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	145.68 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	145.76 ± 0.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	143.93 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	139.73 ± 0.03

JohannesGaessler · 2025-11-29T17:44:15Z

Thank you for reporting this issue. The performance tuning for LLaMA 2 7b in particular was suboptimal because it's a very old model that doesn't use GQA and I forgot to test that particular scenario.

Hedede · 2025-11-29T18:39:53Z

I see that with some other models as well. For example, Qwen3 14B has slightly lower TG throughput with this PR even though other models are faster/same.

With PR:

model	size	params	backend	ngl	fa	test	t/s
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1751.06 ± 0.90
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp1024	1745.60 ± 0.08
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	69.82 ± 0.01
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg256	69.85 ± 0.01
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp512	3581.97 ± 7.89
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp2048	3464.95 ± 3.24
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg128	83.64 ± 0.02
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg256	83.67 ± 0.00
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp512	2011.70 ± 2.65
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp2048	1924.82 ± 0.95
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg128	73.70 ± 0.06
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg256	73.66 ± 0.00
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	pp512	2651.95 ± 22.18
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	tg128	173.33 ± 0.03
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	pp512	904.26 ± 1.52
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	tg128	37.89 ± 0.00
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	1050.19 ± 10.54
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	138.72 ± 0.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	pp512	707.38 ± 0.61
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	tg128	34.09 ± 0.00

Without PR:

model	size	params	backend	ngl	fa	test	t/s
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1592.39 ± 1.04
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp1024	1525.83 ± 0.33
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	71.48 ± 0.01
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg256	71.54 ± 0.02
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp512	3171.77 ± 4.33
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	pp2048	2645.87 ± 1.58
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg128	83.47 ± 0.05
qwen3 8B Q8_0	8.11 GiB	8.19 B	CUDA	99	1	tg256	83.47 ± 0.00
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp512	1862.49 ± 1.69
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	pp2048	1669.18 ± 0.47
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg128	73.07 ± 0.04
gemma3 12B Q4_K - Medium	6.79 GiB	11.77 B	CUDA	99	1	tg256	72.97 ± 0.01
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	pp512	2643.89 ± 21.33
gpt-oss 20B Q4_K - Medium	10.81 GiB	20.91 B	CUDA	99	1	tg128	172.43 ± 0.18
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	pp512	854.29 ± 1.01
gemma3 27B Q4_K - Medium	15.40 GiB	27.01 B	CUDA	99	1	tg128	37.62 ± 0.01
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	999.41 ± 9.82
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	139.05 ± 0.13
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	pp512	644.31 ± 0.56
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	CUDA	99	1	tg128	33.90 ± 0.00

Hedede · 2025-11-29T19:01:03Z

OK, I pulled the latest changes, both models are faster now. Qwen3moe 30B-A3B is also slightly faster.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3511.11 ± 3.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	152.01 ± 0.03
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	pp512	1749.43 ± 1.16
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	CUDA	99	1	tg128	71.63 ± 0.02
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	pp512	1052.40 ± 10.12
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	139.23 ± 0.03

ggerganov · 2025-12-01T19:52:25Z

I can do some testing of the PR, but probable won't be able to provide a comprehensive review. @am17an Could you help with that?

am17an

My comments are just for better readability. The fattn code I don't understand at all at the moment

ggml/include/ggml.h

am17an · 2025-12-02T05:17:48Z

ggml/src/ggml-cuda/fattn-mma-f16.cuh

-        return 64;
-    }
+static constexpr __device__ int ggml_cuda_fattn_mma_get_nthreads(const int DKQ, const int DV, const int ncols) {
+    return (((ggml_cuda_fattn_mma_get_config(DKQ, DV, ncols)     >>  0) & ((1 << 4) - 1)) + 1) * 32;


The launch_bounds not being to handle templates makes this code quite complex and hard to understand. I much prefer the old way. Perhaps for rocM we can hardcode some values? or maybe something like this is better:

struct fattn_values_DKQ_DV_ncols { int nbatch_fa : 4; int nwarps_max: 4; .. } static_assert (sizeof(fattn_values..) == sizeof(int));

I'm open to alternatives but as I will be the primary maintainer for this code I have the following requirements:

Kernel parameters need to be known at compile time.

Consistent handling of NVIDIA and AMD GPUs. This unfortunately makes it impossible to bundle the values as templates or structs (or I just wasn't able to wrangle the compiler in the right way). So I'm instead packing the values as integers.

Kernel parameters for a given template specialization should be specified exactly once and in a single place. With the code on master the fact that there are multiple functions that together define a template specialization is a significant complication if the kernel parameters are not monotonously increasing/decreasing with e.g. the number of Q columns. The only exception is e.g. the use of thin or wide mma.cuh tiles where one option is always preferable if it can be used. With the old system there was also the problem that it was possible to specify inconsistent parameters for host and device and this resulted in multiple bugs.

I think what I suggested should satisfy your requirements, using bit-fields to make an int?

I didn't try the use of bit fields in particular but I was not able to package all relevant values into a struct in such a way that the ROCm compiler would accept it inside __launch_bounds__. I don't think the use of bit fields is relevant here.

If you are able to fit it into a uint32_t you can fit in into a struct with bit-fields. It's simply replacing this code

return ((((nthreads) / 32) - 1) << 0) | \ ((((occupancy) / 1) - 1) << 4) | \ ((((nbatch_fa) / 32) - 1) << 7) | \ ((((nbatch_K2) / 8) - 1) << 10) | \ ((((nbatch_V2) / 8) - 1) << 17) | \ ((((nbatch_combine) / 8) - 1) << 23) | \ ((((nstages_target) / 1) - 1) << 28) | \ (((Q_in_reg) ? 1 : 0) << 29); \

with

struct config { nthreads: 2 occupancy : 3 nbatch_fa : 3 nbatch_k2 : 3 nbatch_v2 : 7 natch_combine: 6 nstages_target: 5 Q_in_reg : 1 }

(bit-widths might be wrong, just for illustrative purposes)

you can just return this struct inside that macro and read the values without shifting bits

It's not a matter of whether or not you can fit the values in some amount of bits, the struct would only be used at compile time and will be optimized out anyways. It's a matter of me not being able to use structs to define launch bounds at all. I was not able to define a constexpr function that returns either a template or a struct and that was accepted by the ROCm compiler for use in __launch_bounds__.

This works for me when using HIP https://github.com/JohannesGaessler/llama.cpp/compare/cuda-fa-mma-update-5...am17an:llama.cpp:cuda_fattn_mma_refactor_config?expand=1

Thank you, I did a quick port of that pattern to the tile kernel where I originally had this issue and it seems to be working correctly. I'll take this over for the mma kernel as well. I think the reason it previously wasn't working for me was that I didn't use a constexpr constructor.

am17an · 2025-12-02T05:22:59Z

ggml/src/ggml-cuda/fattn-mma-f16.cuh

-    bool use_logit_softcap, bool mla, bool needs_fixup, bool is_fixup, bool last_iter>
+template<int DKQ, int DV, int ncols1, int ncols2, int nwarps,
+    bool use_logit_softcap, bool mla, bool needs_fixup, bool is_fixup, bool last_iter, bool oob_check,
+    typename T_A_KQ, typename T_B_KQ, typename T_C_KQ, typename T_A_VKQ, typename T_B_VKQ, typename T_C_VKQ>


I think these should still be called tile_A_KQ etc. for readability

I was previously using names like that and decided to shorten them at some point precisely because otherwise some lines in the kernel would have poor readability due to being too long.

You can still keep them as tile_*, and later use the short-form using the T_ syntax. Without an IDE it is not clear what T_A_KQ* means from the template parameters, whereas most of the other parameters are clearer. Perhaps a traits struct like tile_config would also be good. But of course this is just for readability, we can refactor later as well

JohannesGaessler · 2025-12-02T08:47:55Z

If I'm being perfectly honest, from Diego I've very rarely received feedback w.r.t. the low-level device code for the FA or GEMM kernels. So if you review only the more high-level code that would already be largely equivalent to how things went until now.

am17an · 2025-12-02T08:59:52Z

Thanks, I would also like to do a best effort review for FA/GEMM kernels if you don't mind. Since I do occasionally modify that code they may be useful to you but definitely useful to me in gaining understanding.

For this PR, I just have the comments I gave earlier

ggerganov

Some benchmarks on the DGX Spark - TG is improved at higher contexts:

GGML_CUDA=ON ./scripts/compare-commits.sh master ec176eef7 llama-bench -m ./models/gpt-oss-20b/ggml-model-mxfp4.gguf -m ./models/gpt-oss-120b/ggml-model-mxfp4-00001-of-00003.gguf -m ./models/qwen3-30b-a3b-coder/ggml-model-q8_0.gguf -m ~/.cache/llama.cpp/ggml-org_gemma-3-4b-it-qat-GGUF_gemma-3-4b-it-qat-Q4_0.gguf -fa 1 -d 2048,4096,8192,16384,32768 -p 0 -n 32 -ub 2048 -mmp 0 -r 10

Model	Test	t/s master	t/s `ec176ee`	Speedup
gemma3 4B Q4_0	tg32@d2048	71.71	71.84	1.00
gemma3 4B Q4_0	tg32@d4096	72.85	73.19	1.00
gemma3 4B Q4_0	tg32@d8192	68.27	68.80	1.01
gemma3 4B Q4_0	tg32@d16384	66.78	67.62	1.01
gemma3 4B Q4_0	tg32@d32768	58.65	59.76	1.02
gpt-oss 120B MXFP4 MoE	tg32@d2048	48.23	48.27	1.00
gpt-oss 120B MXFP4 MoE	tg32@d4096	47.41	47.46	1.00
gpt-oss 120B MXFP4 MoE	tg32@d8192	44.16	45.60	1.03
gpt-oss 120B MXFP4 MoE	tg32@d16384	41.61	42.80	1.03
gpt-oss 120B MXFP4 MoE	tg32@d32768	36.90	37.74	1.02
gpt-oss 20B MXFP4 MoE	tg32@d2048	79.22	79.20	1.00
gpt-oss 20B MXFP4 MoE	tg32@d4096	76.26	76.51	1.00
gpt-oss 20B MXFP4 MoE	tg32@d8192	71.97	73.02	1.01
gpt-oss 20B MXFP4 MoE	tg32@d16384	66.86	68.17	1.02
gpt-oss 20B MXFP4 MoE	tg32@d32768	57.81	59.56	1.03
qwen3moe 30B.A3B Q8_0	tg32@d2048	56.37	56.41	1.00
qwen3moe 30B.A3B Q8_0	tg32@d4096	53.67	53.68	1.00
qwen3moe 30B.A3B Q8_0	tg32@d8192	44.10	47.24	1.07
qwen3moe 30B.A3B Q8_0	tg32@d16384	36.91	39.42	1.07
qwen3moe 30B.A3B Q8_0	tg32@d32768	28.02	29.48	1.05

* CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>

* origin/master: CUDA: generalized (mma) FA, add Volta support (ggml-org#17505) chat : reserve memory in compute_diffs and improve naming (ggml-org#17729)

JohannesGaessler requested review from am17an and ggerganov as code owners November 25, 2025 20:45

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 25, 2025

loci-dev mentioned this pull request Nov 25, 2025

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support auroralabs-loci/llama.cpp#328

Open

JohannesGaessler changed the title ~~CUDA: ganeralized (mma) FA, add Volta support~~ CUDA: generalized (mma) FA, add Volta support Nov 25, 2025

JohannesGaessler force-pushed the cuda-fa-mma-update-5 branch 2 times, most recently from 48372ef to 2ef0c5f Compare November 25, 2025 23:09

ggerganov mentioned this pull request Nov 26, 2025

ggml : remove KQ mask padding #16309

Closed

5 tasks

unverbraucht mentioned this pull request Nov 27, 2025

HIP: Add RDNA3 WMMA support to MMF #17495

Open

JohannesGaessler added 3 commits November 28, 2025 16:52

CUDA: generalized (mma) FA, add Volta support

17f191e

fix const correctness

e2c50b1

fix turing config lookup

301ae30

JohannesGaessler force-pushed the cuda-fa-mma-update-5 branch from b92e6f8 to 301ae30 Compare November 28, 2025 16:03

refactor template parameters

13500e8

adjust kernel selection logic

394ced5

fix trailing whitespace

3e1ca0c

fix kernel selection logic

ec176ee

am17an reviewed Dec 2, 2025

View reviewed changes

ggerganov approved these changes Dec 2, 2025

View reviewed changes

am17an approved these changes Dec 2, 2025

View reviewed changes

use struct for MMA FA kernel config

d861a34

JohannesGaessler force-pushed the cuda-fa-mma-update-5 branch from 323c683 to d861a34 Compare December 3, 2025 08:38

JohannesGaessler merged commit 2e1c9cd into ggml-org:master Dec 3, 2025
70 of 77 checks passed

CUDA: generalized (mma) FA, add Volta support #17505

CUDA: generalized (mma) FA, add Volta support #17505

Uh oh!

Conversation

JohannesGaessler commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhang-hui-yulo commented Nov 26, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

Hedede commented Nov 29, 2025

Uh oh!

ggerganov commented Dec 1, 2025

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Dec 2, 2025

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JohannesGaessler commented Nov 25, 2025 •

edited

Loading

am17an Dec 2, 2025 •

edited

Loading

am17an Dec 2, 2025 •

edited

Loading