Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

@gabe-l-hart gabe-l-hart commented Oct 16, 2025

Description

EDIT: This PR has now been updated to match the kernel signatures in master and #17584. CUMSUM was already added in #17305 and there are several more ops added in #17063, so this PR now adds TRI, FILL, EXPM1, and SOFTPLUS for Metal.


This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 16, 2025
@gabe-l-hart gabe-l-hart mentioned this pull request Oct 16, 2025
@gabe-l-hart
Copy link
Collaborator Author

Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures

@JohannesGaessler
Copy link
Collaborator

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

@pwilkin
Copy link
Collaborator

pwilkin commented Oct 17, 2025

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases?

Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next.

I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension.

@JohannesGaessler
Copy link
Collaborator

The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely).

@gabe-l-hart
Copy link
Collaborator Author

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here.

@giuseppe
Copy link
Contributor

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN,

@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Oct 17, 2025

@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:

  1. It would break the ability to reuse the graph objects across generation steps

  2. It would make backends that don't support the primitive ops (CUMSUM and TRI) fall back to CPU (exactly what you're trying to fix in your PR for SSM_SCAN)

The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors.

@fat-tire
Copy link
Contributor

Dumb question-- but can the CUDA gated deltanet code be optimized via Vidrial? It supposedly does autotuning by looking at various configurations to find an ideal configuration... (?)

@JohannesGaessler
Copy link
Collaborator

One of the core goals of llama.cpp/ggml is to avoid external dependencies if at all possible. These kernels are not going to be performance critical so they should just be plain CUDA code no matter what. For the performance critical kernels where I am the main developer and maintainer my opinion is that the code should be as low-level and avoid external dependencies as that makes debugging easier.

@gabe-l-hart gabe-l-hart mentioned this pull request Nov 3, 2025
@CarlGao4
Copy link

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

@CISC
Copy link
Collaborator

CISC commented Nov 28, 2025

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

CUDA support is still missing for CUMSUM and TRI, so will fall back to CPU.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 28, 2025

@gabe-l-hart are you planning to adapt this? If you don't have the time, I can take over and make it compatible with the current master.

@gabe-l-hart
Copy link
Collaborator Author

@pwilkin if you're on a roll, you're welcome to go for it. I can get to it next week sometime if you don't get to it first.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 28, 2025

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

@gabe-l-hart
Copy link
Collaborator Author

Deal, I can definitely tackle those

@wsbagnsv1
Copy link

wsbagnsv1 commented Nov 28, 2025

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

Im open to use my framework to optimize the cuda ones once again after your initial implementation, though I dont have metal hardware to test the metal ones either 😅

@gabe-l-hart
Copy link
Collaborator Author

It looks like @ggerganov added CUMSUM support for metal in #17305, so for metal I think GGML_OP_TRI is the only outstanding piece from this PR.

@gabe-l-hart
Copy link
Collaborator Author

@pwilkin I also see ggml_fill_inplace used in qwen3next and I don't see GGML_OP_FILL in any backends besides cpu and vulkan. Should we get that implemented for metal as well? It seems like it's only used at the top of the model build (here), so maybe they're closer to input tensors and don't need to be computed on the device?

The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…n kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart changed the title ggml: CUMSUM and TRI (CPU, Metal, CUDA) metal: TRI Dec 3, 2025
@gabe-l-hart
Copy link
Collaborator Author

@ggerganov @JohannesGaessler I've updated this to only do TRI for metal since CUDA is being handled in #17584 and CUMSUM was already added in #17305. I did the simplest optimization of removing the conditional from the kernel and dispatching it using a template, but I'm not sure who the deep metal expert is that might be able to help optimize further. Is there a correct reviewer for metal PRs?

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@gabe-l-hart
Copy link
Collaborator Author

It looks like metal is also missing FILL, SOFTPLUS, and EXPM1 to have full support based on #17063. I'll try to add them here as well.

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Dec 3, 2025

@pwilkin I see that in #17063 FILL is added but NOT as unary. Was that done for efficiency?

Never mind, I figured it out!

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart changed the title metal: TRI metal: TRI, FILL, EXPM1, SOFTPLUS Dec 3, 2025
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to adapt to the changes after #17739

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master:
CUDA: generalized (mma) FA, add Volta support (ggml-org#17505)
chat : reserve memory in compute_diffs and improve naming (ggml-org#17729)
* origin/master:
server: strip content-length header on proxy (ggml-org#17734)
server: move msg diffs tracking to HTTP thread (ggml-org#17740)
examples : add missing code block end marker [no ci] (ggml-org#17756)
common : skip model validation when --help is requested (ggml-org#17755)
ggml-cpu : remove asserts always evaluating to false (ggml-org#17728)
convert: use existing local chat_template if mistral-format model has one. (ggml-org#17749)
cmake : simplify build info detection using standard variables (ggml-org#17423)
ci : disable ggml-ci-x64-amd-* (ggml-org#17753)
common: use native MultiByteToWideChar (ggml-org#17738)
metal : use params per pipeline instance (ggml-org#17739)
llama : fix sanity checks during quantization (ggml-org#17721)
build : move _WIN32_WINNT definition to headers (ggml-org#17736)
build: enable parallel builds in msbuild using MTT (ggml-org#17708)
ggml-cpu: remove duplicate conditional check 'iid' (ggml-org#17650)
Add a couple of file types to the text section (ggml-org#17670)
convert : support latest mistral-common (fix conversion with --mistral-format) (ggml-org#17712)
Use OpenAI-compatible `/v1/models` endpoint by default (ggml-org#17689)
webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (ggml-org#17445)
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Collaborator Author

Ok, I think I've got things squared away with #17739 now. Thanks for pointing that out @ggerganov!

@ggerganov ggerganov merged commit bde188d into ggml-org:master Dec 4, 2025
65 of 69 checks passed
@gabe-l-hart gabe-l-hart deleted the ggml-cumsum-tri branch December 4, 2025 17:31
@CISC
Copy link
Collaborator

CISC commented Dec 4, 2025

Could someone regen the BLAS and Metal CSVs/ops.md (not only to add new ops, but to clear out fusion ops)?

@gabe-l-hart
Copy link
Collaborator Author

🤦 I even thought to do that yesterday then got distracted. I'll get a separate PR up for it shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants