-
Notifications
You must be signed in to change notification settings - Fork 14k
CUDA: generalized (mma) FA, add Volta support #17505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: generalized (mma) FA, add Volta support #17505
Conversation
48372ef to
2ef0c5f
Compare
|
Thank you for the info, I shall work on FA for RDNA4 once this PR is merged. Looks like that the logic of transposed tile is still empty. |
b92e6f8 to
301ae30
Compare
|
Testing the performance: prefill performance is greatly improved, however, TG is slower. I think it's better to use BEST_FATTN_KERNEL_VEC for tg. On the master branch (7d2add5): ./build-volta/bin/llama-bench -m /models/llm/llama/llama-2-7b.Q4_0.gguf -fa 0,1 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024
With this PR merged:
|
|
Thank you for reporting this issue. The performance tuning for LLaMA 2 7b in particular was suboptimal because it's a very old model that doesn't use GQA and I forgot to test that particular scenario. |
|
I see that with some other models as well. For example, Qwen3 14B has slightly lower TG throughput with this PR even though other models are faster/same. With PR:
Without PR:
|
|
OK, I pulled the latest changes, both models are faster now. Qwen3moe 30B-A3B is also slightly faster.
|
|
I can do some testing of the PR, but probable won't be able to provide a comprehensive review. @am17an Could you help with that? |
am17an
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are just for better readability. The fattn code I don't understand at all at the moment
ggml/src/ggml-cuda/fattn-mma-f16.cuh
Outdated
| return 64; | ||
| } | ||
| static constexpr __device__ int ggml_cuda_fattn_mma_get_nthreads(const int DKQ, const int DV, const int ncols) { | ||
| return (((ggml_cuda_fattn_mma_get_config(DKQ, DV, ncols) >> 0) & ((1 << 4) - 1)) + 1) * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The launch_bounds not being to handle templates makes this code quite complex and hard to understand. I much prefer the old way. Perhaps for rocM we can hardcode some values? or maybe something like this is better:
struct fattn_values_DKQ_DV_ncols {
int nbatch_fa : 4;
int nwarps_max: 4; ..
}
static_assert (sizeof(fattn_values..) == sizeof(int));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to alternatives but as I will be the primary maintainer for this code I have the following requirements:
- Kernel parameters need to be known at compile time.
- Consistent handling of NVIDIA and AMD GPUs. This unfortunately makes it impossible to bundle the values as templates or structs (or I just wasn't able to wrangle the compiler in the right way). So I'm instead packing the values as integers.
- Kernel parameters for a given template specialization should be specified exactly once and in a single place. With the code on master the fact that there are multiple functions that together define a template specialization is a significant complication if the kernel parameters are not monotonously increasing/decreasing with e.g. the number of Q columns. The only exception is e.g. the use of thin or wide
mma.cuhtiles where one option is always preferable if it can be used. With the old system there was also the problem that it was possible to specify inconsistent parameters for host and device and this resulted in multiple bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I suggested should satisfy your requirements, using bit-fields to make an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try the use of bit fields in particular but I was not able to package all relevant values into a struct in such a way that the ROCm compiler would accept it inside __launch_bounds__. I don't think the use of bit fields is relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are able to fit it into a uint32_t you can fit in into a struct with bit-fields. It's simply replacing this code
return ((((nthreads) / 32) - 1) << 0) | \
((((occupancy) / 1) - 1) << 4) | \
((((nbatch_fa) / 32) - 1) << 7) | \
((((nbatch_K2) / 8) - 1) << 10) | \
((((nbatch_V2) / 8) - 1) << 17) | \
((((nbatch_combine) / 8) - 1) << 23) | \
((((nstages_target) / 1) - 1) << 28) | \
(((Q_in_reg) ? 1 : 0) << 29); \
with
struct config {
nthreads: 2
occupancy : 3
nbatch_fa : 3
nbatch_k2 : 3
nbatch_v2 : 7
natch_combine: 6
nstages_target: 5
Q_in_reg : 1
}
(bit-widths might be wrong, just for illustrative purposes)
you can just return this struct inside that macro and read the values without shifting bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a matter of whether or not you can fit the values in some amount of bits, the struct would only be used at compile time and will be optimized out anyways. It's a matter of me not being able to use structs to define launch bounds at all. I was not able to define a constexpr function that returns either a template or a struct and that was accepted by the ROCm compiler for use in __launch_bounds__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I did a quick port of that pattern to the tile kernel where I originally had this issue and it seems to be working correctly. I'll take this over for the mma kernel as well. I think the reason it previously wasn't working for me was that I didn't use a constexpr constructor.
| bool use_logit_softcap, bool mla, bool needs_fixup, bool is_fixup, bool last_iter> | ||
| template<int DKQ, int DV, int ncols1, int ncols2, int nwarps, | ||
| bool use_logit_softcap, bool mla, bool needs_fixup, bool is_fixup, bool last_iter, bool oob_check, | ||
| typename T_A_KQ, typename T_B_KQ, typename T_C_KQ, typename T_A_VKQ, typename T_B_VKQ, typename T_C_VKQ> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should still be called tile_A_KQ etc. for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was previously using names like that and decided to shorten them at some point precisely because otherwise some lines in the kernel would have poor readability due to being too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still keep them as tile_*, and later use the short-form using the T_ syntax. Without an IDE it is not clear what T_A_KQ* means from the template parameters, whereas most of the other parameters are clearer. Perhaps a traits struct like tile_config would also be good. But of course this is just for readability, we can refactor later as well
|
If I'm being perfectly honest, from Diego I've very rarely received feedback w.r.t. the low-level device code for the FA or GEMM kernels. So if you review only the more high-level code that would already be largely equivalent to how things went until now. |
|
Thanks, I would also like to do a best effort review for FA/GEMM kernels if you don't mind. Since I do occasionally modify that code they may be useful to you but definitely useful to me in gaining understanding. For this PR, I just have the comments I gave earlier |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some benchmarks on the DGX Spark - TG is improved at higher contexts:
GGML_CUDA=ON ./scripts/compare-commits.sh master ec176eef7 llama-bench -m ./models/gpt-oss-20b/ggml-model-mxfp4.gguf -m ./models/gpt-oss-120b/ggml-model-mxfp4-00001-of-00003.gguf -m ./models/qwen3-30b-a3b-coder/ggml-model-q8_0.gguf -m ~/.cache/llama.cpp/ggml-org_gemma-3-4b-it-qat-GGUF_gemma-3-4b-it-qat-Q4_0.gguf -fa 1 -d 2048,4096,8192,16384,32768 -p 0 -n 32 -ub 2048 -mmp 0 -r 10| Model | Test | t/s master | t/s ec176ee | Speedup |
|---|---|---|---|---|
| gemma3 4B Q4_0 | tg32@d2048 | 71.71 | 71.84 | 1.00 |
| gemma3 4B Q4_0 | tg32@d4096 | 72.85 | 73.19 | 1.00 |
| gemma3 4B Q4_0 | tg32@d8192 | 68.27 | 68.80 | 1.01 |
| gemma3 4B Q4_0 | tg32@d16384 | 66.78 | 67.62 | 1.01 |
| gemma3 4B Q4_0 | tg32@d32768 | 58.65 | 59.76 | 1.02 |
| gpt-oss 120B MXFP4 MoE | tg32@d2048 | 48.23 | 48.27 | 1.00 |
| gpt-oss 120B MXFP4 MoE | tg32@d4096 | 47.41 | 47.46 | 1.00 |
| gpt-oss 120B MXFP4 MoE | tg32@d8192 | 44.16 | 45.60 | 1.03 |
| gpt-oss 120B MXFP4 MoE | tg32@d16384 | 41.61 | 42.80 | 1.03 |
| gpt-oss 120B MXFP4 MoE | tg32@d32768 | 36.90 | 37.74 | 1.02 |
| gpt-oss 20B MXFP4 MoE | tg32@d2048 | 79.22 | 79.20 | 1.00 |
| gpt-oss 20B MXFP4 MoE | tg32@d4096 | 76.26 | 76.51 | 1.00 |
| gpt-oss 20B MXFP4 MoE | tg32@d8192 | 71.97 | 73.02 | 1.01 |
| gpt-oss 20B MXFP4 MoE | tg32@d16384 | 66.86 | 68.17 | 1.02 |
| gpt-oss 20B MXFP4 MoE | tg32@d32768 | 57.81 | 59.56 | 1.03 |
| qwen3moe 30B.A3B Q8_0 | tg32@d2048 | 56.37 | 56.41 | 1.00 |
| qwen3moe 30B.A3B Q8_0 | tg32@d4096 | 53.67 | 53.68 | 1.00 |
| qwen3moe 30B.A3B Q8_0 | tg32@d8192 | 44.10 | 47.24 | 1.07 |
| qwen3moe 30B.A3B Q8_0 | tg32@d16384 | 36.91 | 39.42 | 1.07 |
| qwen3moe 30B.A3B Q8_0 | tg32@d32768 | 28.02 | 29.48 | 1.05 |
323c683 to
d861a34
Compare
* CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>
* origin/master: CUDA: generalized (mma) FA, add Volta support (ggml-org#17505) chat : reserve memory in compute_diffs and improve naming (ggml-org#17729)
This PR makes the following changes to the CUDA FlashAttention code:
mask->ne[1]direction. This is done by applying a modulo on the mask column that is being read so no conditional statements need to be evaluated. The impact on performance is negligible and I do not deem it necessary to compile additional template specializations. See ggml : remove KQ mask padding #16309 . cc @ggerganov .tiletemplate inmma.cuhhas been extended with additional, optional arguments to safely handle situations where tiles of the same shape can have different physical data layouts.__launch_bounds__when using ROCm (as of right now ROCm is not used).K->ne[1]. As with the tile kernel, because this comes at a cost to performance it is still preferable to pad the KV cache length. As of right now this is still required to be 256, for the currently supported GPUs it should be possible to lower this to 128 without issue once the WMMA kernel has been completely replaced. For Hopper it may still make sense to have a padding of 256 but as it is I have no idea whether the 256x64 instruction would actually have better performance than the 128x64 instruction.As of right now the interface in
mma.cuhis suboptimal and long-term I intend to refactor it to allow the use of tensor cores in a more uniform way. However, I don't know the exact requirements until we have proper support for AMD WMMA and AMD MFMA instructions. So for now I think the correct choice is to prioritize getting working support for those at the cost of maintainability and to do a refactor afterwards.V100 performance
Other GPU performance
The performance numbers assume that the KQ mask is no longer being padded. This change is also in this PR. I don't have a good overview of which other backends maybe still need support for this change and whether or not it should be reverted prior to merging.