metal: TRI, FILL, EXPM1, SOFTPLUS #16623

gabe-l-hart · 2025-10-16T20:38:19Z

Description

EDIT: This PR has now been updated to match the kernel signatures in master and #17584. CUMSUM was already added in #17305 and there are several more ops added in #17063, so this PR now adds TRI, FILL, EXPM1, and SOFTPLUS for Metal.

This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

gabe-l-hart · 2025-10-16T21:10:01Z

Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures

JohannesGaessler · 2025-10-17T09:20:32Z

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

pwilkin · 2025-10-17T10:07:57Z

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases?

Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next.

I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension.

JohannesGaessler · 2025-10-17T10:13:04Z

The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely).

gabe-l-hart · 2025-10-17T11:43:54Z

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here.

giuseppe · 2025-10-17T12:27:09Z

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN,

@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4

gabe-l-hart · 2025-10-17T13:11:20Z

@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:

It would break the ability to reuse the graph objects across generation steps
It would make backends that don't support the primitive ops (CUMSUM and TRI) fall back to CPU (exactly what you're trying to fix in your PR for SSM_SCAN)

The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors.

fat-tire · 2025-10-20T17:47:23Z

Dumb question-- but can the CUDA gated deltanet code be optimized via Vidrial? It supposedly does autotuning by looking at various configurations to find an ideal configuration... (?)

JohannesGaessler · 2025-10-21T13:35:00Z

One of the core goals of llama.cpp/ggml is to avoid external dependencies if at all possible. These kernels are not going to be performance critical so they should just be plain CUDA code no matter what. For the performance critical kernels where I am the main developer and maintainer my opinion is that the code should be as low-level and avoid external dependencies as that makes debugging easier.

CarlGao4 · 2025-11-28T09:23:21Z

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

CISC · 2025-11-28T09:31:22Z

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

CUDA support is still missing for CUMSUM and TRI, so will fall back to CPU.

pwilkin · 2025-11-28T12:29:41Z

@gabe-l-hart are you planning to adapt this? If you don't have the time, I can take over and make it compatible with the current master.

gabe-l-hart · 2025-11-28T14:22:47Z

@pwilkin if you're on a roll, you're welcome to go for it. I can get to it next week sometime if you don't get to it first.

pwilkin · 2025-11-28T14:54:32Z

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

gabe-l-hart · 2025-11-28T14:56:57Z

Deal, I can definitely tackle those

wsbagnsv1 · 2025-11-28T19:03:39Z

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

Im open to use my framework to optimize the cuda ones once again after your initial implementation, though I dont have metal hardware to test the metal ones either 😅

gabe-l-hart · 2025-12-03T14:42:11Z

It looks like @ggerganov added CUMSUM support for metal in #17305, so for metal I think GGML_OP_TRI is the only outstanding piece from this PR.

gabe-l-hart · 2025-12-03T14:47:32Z

@pwilkin I also see ggml_fill_inplace used in qwen3next and I don't see GGML_OP_FILL in any backends besides cpu and vulkan. Should we get that implemented for metal as well? It seems like it's only used at the top of the model build (here), so maybe they're closer to input tensors and don't need to be computed on the device?

The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-12-03T15:55:15Z

@ggerganov @JohannesGaessler I've updated this to only do TRI for metal since CUDA is being handled in #17584 and CUMSUM was already added in #17305. I did the simplest optimization of removing the conditional from the kernel and dispatching it using a template, but I'm not sure who the deep metal expert is that might be able to help optimize further. Is there a correct reviewer for metal PRs?

ggml/src/ggml-metal/ggml-metal.metal

ggml/src/ggml-metal/ggml-metal-ops.cpp

ggml/src/ggml-metal/ggml-metal-impl.h

ggml/src/ggml-metal/ggml-metal-device.cpp

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

gabe-l-hart · 2025-12-03T18:26:40Z

It looks like metal is also missing FILL, SOFTPLUS, and EXPM1 to have full support based on #17063. I'll try to add them here as well.

gabe-l-hart · 2025-12-03T18:32:33Z

~~@pwilkin I see that in #17063 FILL is added but NOT as unary. Was that done for efficiency?~~

Never mind, I figured it out!

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ggerganov

Need to adapt to the changes after #17739

ggml/src/ggml-metal/ggml-metal.metal

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* origin/master: CUDA: generalized (mma) FA, add Volta support (ggml-org#17505) chat : reserve memory in compute_diffs and improve naming (ggml-org#17729)

* origin/master: server: strip content-length header on proxy (ggml-org#17734) server: move msg diffs tracking to HTTP thread (ggml-org#17740) examples : add missing code block end marker [no ci] (ggml-org#17756) common : skip model validation when --help is requested (ggml-org#17755) ggml-cpu : remove asserts always evaluating to false (ggml-org#17728) convert: use existing local chat_template if mistral-format model has one. (ggml-org#17749) cmake : simplify build info detection using standard variables (ggml-org#17423) ci : disable ggml-ci-x64-amd-* (ggml-org#17753) common: use native MultiByteToWideChar (ggml-org#17738) metal : use params per pipeline instance (ggml-org#17739) llama : fix sanity checks during quantization (ggml-org#17721) build : move _WIN32_WINNT definition to headers (ggml-org#17736) build: enable parallel builds in msbuild using MTT (ggml-org#17708) ggml-cpu: remove duplicate conditional check 'iid' (ggml-org#17650) Add a couple of file types to the text section (ggml-org#17670) convert : support latest mistral-common (fix conversion with --mistral-format) (ggml-org#17712) Use OpenAI-compatible `/v1/models` endpoint by default (ggml-org#17689) webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (ggml-org#17445)

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-12-04T16:30:43Z

Ok, I think I've got things squared away with #17739 now. Thanks for pointing that out @ggerganov!

CISC · 2025-12-04T17:37:01Z

Could someone regen the BLAS and Metal CSVs/ops.md (not only to add new ops, but to clear out fusion ops)?

gabe-l-hart · 2025-12-04T18:21:13Z

🤦 I even thought to do that yesterday then got distracted. I'll get a separate PR up for it shortly

gabe-l-hart requested review from CISC, ggerganov and slaren as code owners October 16, 2025 20:38

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 16, 2025

gabe-l-hart mentioned this pull request Oct 16, 2025

Model: Qwen3 Next #16095

Merged

gabe-l-hart mentioned this pull request Nov 3, 2025

Mamba2 SSD #16982

Draft

pwilkin mentioned this pull request Nov 28, 2025

Add support for CUMSUM and TRI for CUDA. #17584

Merged

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. auroralabs-loci/llama.cpp#355

Open

gabe-l-hart added 3 commits December 3, 2025 08:18

feat(wip): Port initial TRI impl from pervious work

a9c9244

The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove argument for constant val override

2a7bbc7

This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Move the ttype conditional to templating to avoid conditional i…

f2ad887

…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the ggml-cumsum-tri branch from 5071fbd to f2ad887 Compare December 3, 2025 15:47

gabe-l-hart changed the title ~~ggml: CUMSUM and TRI (CPU, Metal, CUDA)~~ metal: TRI Dec 3, 2025

ggerganov reviewed Dec 3, 2025

View reviewed changes

fix: Type fixes

6a27050

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

gabe-l-hart added 3 commits December 3, 2025 11:48

feat: Add softplus for metal

7cbbff7

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add EXPM1 for metal

434ec07

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add FILL for metal

1496afd

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart changed the title ~~metal: TRI~~ metal: TRI, FILL, EXPM1, SOFTPLUS Dec 3, 2025

refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask

7690808

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ggerganov approved these changes Dec 4, 2025

View reviewed changes

ggml/src/ggml-metal/ggml-metal.metal Outdated Show resolved Hide resolved

gabe-l-hart added 4 commits December 4, 2025 09:18

fix: Remove unused arguments

60fe39b

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Merge remote-tracking branch 'origin/master' into ggml-cumsum-tri

c195af2

* origin/master: CUDA: generalized (mma) FA, add Volta support (ggml-org#17505) chat : reserve memory in compute_diffs and improve naming (ggml-org#17729)

refactor: Use select instead of branch for softplus non-vec

338acb3

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

loci-dev mentioned this pull request Dec 4, 2025

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS auroralabs-loci/llama.cpp#434

Open

ggerganov merged commit bde188d into ggml-org:master Dec 4, 2025
65 of 69 checks passed

gabe-l-hart deleted the ggml-cumsum-tri branch December 4, 2025 17:31

gabe-l-hart mentioned this pull request Dec 4, 2025

Update ops md (Metal, BLAS) #17768

Merged

metal: TRI, FILL, EXPM1, SOFTPLUS #16623

metal: TRI, FILL, EXPM1, SOFTPLUS #16623

Conversation

gabe-l-hart commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gabe-l-hart commented Oct 16, 2025

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

pwilkin commented Oct 17, 2025

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

gabe-l-hart commented Oct 17, 2025

Uh oh!

giuseppe commented Oct 17, 2025

Uh oh!

gabe-l-hart commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fat-tire commented Oct 20, 2025

Uh oh!

JohannesGaessler commented Oct 21, 2025

Uh oh!

CarlGao4 commented Nov 28, 2025

Uh oh!

CISC commented Nov 28, 2025

Uh oh!

pwilkin commented Nov 28, 2025

Uh oh!

gabe-l-hart commented Nov 28, 2025

Uh oh!

pwilkin commented Nov 28, 2025

Uh oh!

gabe-l-hart commented Nov 28, 2025

Uh oh!

wsbagnsv1 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Dec 3, 2025

Uh oh!

gabe-l-hart commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart commented Dec 4, 2025

Uh oh!

Uh oh!

CISC commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

gabe-l-hart commented Oct 16, 2025 •

edited

Loading

gabe-l-hart commented Oct 17, 2025 •

edited

Loading

wsbagnsv1 commented Nov 28, 2025 •

edited

Loading

gabe-l-hart commented Dec 3, 2025 •

edited

Loading

CISC commented Dec 4, 2025 •

edited

Loading