-
Notifications
You must be signed in to change notification settings - Fork 14k
vulkan: Make Vulkan optional at runtime (#11493). #11494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good for this to explicitly say something like "will fallback to CPU".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a good idea, but one backend doesn't know what the user of 4 different backends (at the same time) will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could adapt ggml/src/ggml-backend-reg.cpp
void register_backend(ggml_backend_reg_t reg, dl_handle_ptr handle = nullptr) {
if (!reg) {
return;
}
#ifndef NDEBUG
GGML_LOG_DEBUG("%s: registered backend %s (%zu devices)\n",
__func__, ggml_backend_reg_name(reg), ggml_backend_reg_dev_count(reg));
#endif
backends.push_back({ reg, std::move(handle) });
for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); i++) {
register_device(ggml_backend_reg_dev_get(reg, i));
}
}
to interpret ggml_backend_reg_dev_count(reg) returning 0 as "oops, don't use me". Eventually, as we have registered no device at all even though we tried we could say we are now using CPU only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some backends intentionally may have zero devices, for example the RPC backend does not have a device list by itself, they need to be created by the user. However returning NULL for backends where this is not possible can be more efficient, since it will cause the backend to be unloaded completely when using GGML_BACKEND_DL. So that would be the preferred option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, when I enable GGML_BACKEND_DL (in addition to GGML_VULKAN), the vulkan backend file disappears entirely from the installation.
I think libggml-vulkan.so moves from lib to bin (p.s. should be lib instead, no?) and cmake install doesn't know that or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through ggml-vulkan.cpp, it seems the intention is to late bind which vulkan instance to use exactly (defer decision as long as possible--which right now is not long at all). There's a mysterious comment
// Should be changed to return device-specific host buffer type
// but that probably requires changes in llama.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think libggml-vulkan.so moves from lib to bin and cmake install doesn't know that or something.
When GGML_BACKEND_DL is enabled, backends are built as MODULE targets instead of library, and one of the consequences is they go into the RUNTIME directory instead. It's not very clear where they should be installed, currently ggml only looks for backends in the same directory as the executable, so for it to even work, they would need to be installed in the bin directory, which is not great. So at the moment this is only useful for applications that handle backend loading themselves, but not as installable libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is just about host buffers, not immediately relevant to this.
The intention was not to defer the decision, but at the time of writing it was unclear which function would get called first, so there's a number of options that trigger initializing the instance. Not sure if that has changed.
I think it's a good idea to leave a message about not having found any Vulkan devices or failing to initialize the instance, but you should probably use the GGML debug macro for that instead of piping to std::cerr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was probably written before the device/reg interfaces were added. Now the device interface has a function to obtain a host buffer for that device, so ideally it should be implemented so that each device returns the correct host buffer. llama.cpp at the moment only uses the host buffer of the first device in the list of devices (which may not be the default device if the user uses the -dev argument).
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a failure here will leave things in a weird partially-initialized state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could special-case just the one exception vk::IncompatibleDriverError that happens here--under the assumption that that one won't leave it in a partially-initialized state. What do you think?
The idea is that if the returned device count is 0 nobody will bother that backend again. So partially initialized or not--it won't be used.
Now if it left the GPU in a partially initialized state and other backends would fail using that GPU because of it, in my opinion that would be a Vulkan bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two expected cases, failure to initialize the Vulkan instance (issue with the loader) or no devices found (but an instance was created). It would probably be good to handle those inside of ggml_vk_instance_init(), to be able to clean up the instance if one was created.
Is it better? What's your use case? I'm not opposed to this in princible, but it also isn't immediately problematic that the Vulkan backend requires Vulkan and a Vulkan-compatible device. Did you check how other backends handle this case? |
Than crashing before even reading the configuration? I think so.
The use case is that distributions can package llama.cpp once--and not have to create 2^6 different packages for the different enable/disable backend combinations.
I did not check that yet. Could someone with the respective backend already compiled in please try running |
|
The intention is to allow builds with multiple backends and let the application determine which ones to use at runtime. If a backend cannot work on the current system it must return null to the reg function, or return zero devices, but it must absolutely not crash the application. |
That makes sense, I'm still used to the separated builds. Does running multiple backends together already work? That leads to another question, too: Which backend takes priority? How do you avoid using the same device twice with two backends? |
|
It does work, especially with
This is not solved yet, and that's one of reasons we still aren't distributing unified builds with multiple backends. However, the user can manually specify which backend/devices to use with the |
78610e7 to
bc7f4bb
Compare
|
I changed it to initialize on reg on. Tested it and it still works. |
slaren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested the changes, but the logic looks correct.
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
0cc4m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, LGTM
…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
|
If -ngl 0 is selected should we try and trigger this too? Getting reports ggml-cpu is faster than vulkan with -ngl 0... |
No, definitely not. There are cases where that may happen, but in most cases it is advantageous to run large matmul ops on the GPU. That is what happens when you run a GPU backend with -ngl 0. |
The kinds of cases are ones like when the vulkan device is actually a CPU: |
|
Most of those cases should be covered, CPU devices do not get picked by default unless there is no other device available, and CPU devices are rare. I'm only aware of llvmpipe on Linux. But you are right, for that case there is an argument for defaulting to no offload. We currently use llvmpipe for the CI test-backend-ops run, but that can still be done by manually selecting the device. If you wanna submit a PR that changes that behaviour, that's okay with me. |
|
Seeing containers/ramalama#1479 I now understand more clearly what you meant, please link the related issue next time. I'll take care of it. 👍 |
Currently, if the Vulkan backend is enabled but Vulkan is not actually available at runtime, it will crash:
Better to just fall back to CPU. This is what this PR does.