Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Dec 1, 2025

GGML_CUDA_GRAPH_OPT=1 is broken with any tensor offloading options like n-cpu-moe because we just copy the graph back to enable fusion within streams. This PR only re-orders nodes within streams.

Also, because we don't use CUDA graphs for hybrid inference at the moment, GGML_CUDA_GRAPH_OPT=1 is slower than not using it. This may change in the future

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 1, 2025
@wishstudio
Copy link
Contributor

wishstudio commented Dec 1, 2025

I'm trying to merging master on my PR #16548 and hit hard by this 😭

Is it possible to avoid changing cgraph in evaluate_and_capture_cuda_graph? Because the graph plan api generally expects to use const struct ggml_cgraph and does not expect the backend to change cgraph halfway (which is good design IMO).

Does the node changes need to be persistent after returning from this function? If not, isn't it easier to just create a temporary cgraph to use inside this function, and discard it before returning. struct ggml_cgraph is just a few pointers, duplicating won't affect performance at all.

It is of course possible to hack over this in my PR, but I feel the current solution is already a bit hacky and I would like to avoid a hack over a hack.

@am17an
Copy link
Collaborator Author

am17an commented Dec 1, 2025

@wishstudio for now you should just not use this path in your PR (it is not enabled by default)

@am17an
Copy link
Collaborator Author

am17an commented Dec 1, 2025

BTW @wishstudio do you still face an issue after this PR? I think it solves your particular issue

@wishstudio
Copy link
Contributor

wishstudio commented Dec 1, 2025 via email

@JohannesGaessler
Copy link
Collaborator

@wishstudio sorry, I was not aware of your efforts and that the recently merged approach would be causing problems for you. As I'm sure you're aware Diego is currently on Hiatus. For this reason pull requests such as yours that have to do with his work will probably not receive good attention from maintainers. As it is I'm already operating at full capacity but I intend to also do work having to do with ggml backends. When I do I'll take a look at the ggml graph plan API and see if I can review your work (definitely no promises though). FWIW I don't consider the current implementation for concurrent CUDA streams to be something that we should be using long-term and I agree that ggml backends should not be changing ggml graphs if at all possible.

@am17an
Copy link
Collaborator Author

am17an commented Dec 2, 2025

@wishstudio I'll take a look at your PR and try to review it as well

@am17an am17an merged commit ed32089 into ggml-org:master Dec 2, 2025
72 of 74 checks passed
@am17an am17an deleted the cuda_graph_opt_cpu_moe branch December 2, 2025 04:36
@wishstudio
Copy link
Contributor

@am17an After you merged this PR I updated mine and it is working fine now. Thanks!

@JohannesGaessler Thank you for the explanation and I appreciate your stance on code quality. Please take your time! I think that PR also enables more graph based optimizations to be done in all the backends, but it still needs some design work. The annoying part atm is I did some refactoring and the diff all got messed up, causing conflicts every time evaluate_and_capture_cuda_graph is changed.

@am17an
Copy link
Collaborator Author

am17an commented Dec 3, 2025

BTW to be clear as to there's no confusion. The nodes don't change "halfway through", they are captured by cuda graph and they remain the same way throughout until the graph is captured again.

This PR fixed a bug when the size of the cgraph changes because there are splits added for cpu moe, I will add code to disable this when cuda graphs cannot be used as it leads to a performance drop anyway.

Reordering or skipping nodes is allowed in graph compute under the correct conditions. The earlier PR plus this PR does not break this paradigm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants