vulkan: Make Vulkan optional at runtime (#11493). #11494

daym · 2025-01-29T17:51:11Z

Currently, if the Vulkan backend is enabled but Vulkan is not actually available at runtime, it will crash:

terminate called after throwing an instance of 'vk::IncompatibleDriverError'
  what():  vk::createInstance: ErrorIncompatibleDriver

Thread 1 "test-tokenizer-" received signal SIGABRT, Aborted.
0x00007ffff6eaa3fc in __pthread_kill_implementation () from /gnu/store/zvlp3n8iwa1svxmwv4q22pv1pb1c9pjq-glibc-2.39/lib/libc.so.6
(gdb) 
(gdb) bt
#0  0x00007ffff6eaa3fc in __pthread_kill_implementation () from /gnu/store/zvlp3n8iwa1svxmwv4q22pv1pb1c9pjq-glibc-2.39/lib/libc.so.6
#1  0x00007ffff6e604c2 in raise () from /gnu/store/zvlp3n8iwa1svxmwv4q22pv1pb1c9pjq-glibc-2.39/lib/libc.so.6
#2  0x00007ffff6e4a4a3 in abort () from /gnu/store/zvlp3n8iwa1svxmwv4q22pv1pb1c9pjq-glibc-2.39/lib/libc.so.6
#3  0x00007ffff70a586a in ?? () from /gnu/store/zzpbp6rr43smwxzvzd4qd317z5j7qblj-gcc-11.4.0-lib/lib/libstdc++.so.6
#4  0x00007ffff70b0e6a in ?? () from /gnu/store/zzpbp6rr43smwxzvzd4qd317z5j7qblj-gcc-11.4.0-lib/lib/libstdc++.so.6
#5  0x00007ffff70b0ed5 in std::terminate() () from /gnu/store/zzpbp6rr43smwxzvzd4qd317z5j7qblj-gcc-11.4.0-lib/lib/libstdc++.so.6
#6  0x00007ffff70b1128 in __cxa_throw () from /gnu/store/zzpbp6rr43smwxzvzd4qd317z5j7qblj-gcc-11.4.0-lib/lib/libstdc++.so.6
#7  0x00007ffff743b5b7 in vk::detail::throwResultException (message=0x7ffff74c5966 "vk::createInstance", result=vk::Result::eErrorIncompatibleDriver)
    at /gnu/store/14lzxwg5kbq01rnd7r7ir5k43083275j-vulkan-headers-1.3.280.0/include/vulkan/vulkan.hpp:6566
#8  vk::resultCheck (message=0x7ffff74c5966 "vk::createInstance", result=vk::Result::eErrorIncompatibleDriver)
    at /gnu/store/14lzxwg5kbq01rnd7r7ir5k43083275j-vulkan-headers-1.3.280.0/include/vulkan/vulkan.hpp:6757
#9  vk::createInstance<vk::DispatchLoaderStatic> (d=..., allocator=..., createInfo=...) at /gnu/store/14lzxwg5kbq01rnd7r7ir5k43083275j-vulkan-headers-1.3.280.0/include/vulkan/vulkan_funcs.hpp:47
#10 ggml_vk_instance_init () at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-vulkan/ggml-vulkan.cpp:2713
#11 0x00007ffff74772e9 in ggml_vk_get_device_count () at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7305
#12 ggml_backend_vk_get_device_count () at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7768
#13 0x00007ffff7477309 in ggml_backend_vk_reg_get_device_count (reg=<optimized out>) at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-vulkan/ggml-vulkan.cpp:8113
#14 0x00007ffff7e54dfa in ggml_backend_registry::register_backend (handle=..., reg=0x7ffff74e19a0 <ggml_backend_vk_reg::reg>, this=0x7ffff7e5d300 <get_reg()::reg>)
    at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-backend-reg.cpp:208
#15 ggml_backend_registry::register_backend (handle=..., reg=0x7ffff74e19a0 <ggml_backend_vk_reg::reg>, this=0x7ffff7e5d300 <get_reg()::reg>)
    at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-backend-reg.cpp:198
#16 ggml_backend_registry::ggml_backend_registry (this=0x7ffff7e5d300 <get_reg()::reg>) at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-backend-reg.cpp:166
#17 get_reg () at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-backend-reg.cpp:292
#18 0x00007ffff7e551e9 in ggml_backend_dev_count () at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/ggml/src/ggml-backend-reg.cpp:336
#19 0x00007ffff7eb1a19 in llama_model_load_from_file_impl (path_model=..., splits=..., params=...) at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/src/llama.cpp:9409
#20 0x00007ffff7eb1c3b in llama_model_load_from_file (path_model=<optimized out>, params=...) at /tmp/guix-build-llama-cpp-0.0.0-b4549.drv-0/source/src/llama.cpp:9469
#21 0x0000000000410bdb in main (argc=2, argv=0x7fffffff5fe8) at /gnu/store/86fc8bi3mciljxz7c79jx8zr4wsx7xw8-gcc-11.4.0/include/c++/bits/basic_string.h

Better to just fall back to CPU. This is what this PR does.

jeffbolznv · 2025-01-29T19:22:22Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

Would be good for this to explicitly say something like "will fallback to CPU".

Would be a good idea, but one backend doesn't know what the user of 4 different backends (at the same time) will do.

We could adapt ggml/src/ggml-backend-reg.cpp

void register_backend(ggml_backend_reg_t reg, dl_handle_ptr handle = nullptr) { if (!reg) { return; } #ifndef NDEBUG GGML_LOG_DEBUG("%s: registered backend %s (%zu devices)\n", __func__, ggml_backend_reg_name(reg), ggml_backend_reg_dev_count(reg)); #endif backends.push_back({ reg, std::move(handle) }); for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); i++) { register_device(ggml_backend_reg_dev_get(reg, i)); } }

to interpret ggml_backend_reg_dev_count(reg) returning 0 as "oops, don't use me". Eventually, as we have registered no device at all even though we tried we could say we are now using CPU only.

Some backends intentionally may have zero devices, for example the RPC backend does not have a device list by itself, they need to be created by the user. However returning NULL for backends where this is not possible can be more efficient, since it will cause the backend to be unloaded completely when using GGML_BACKEND_DL. So that would be the preferred option.

For what it's worth, when I enable GGML_BACKEND_DL (in addition to GGML_VULKAN), the vulkan backend file disappears entirely from the installation.

I think libggml-vulkan.so moves from lib to bin (p.s. should be lib instead, no?) and cmake install doesn't know that or something.

Reading through ggml-vulkan.cpp, it seems the intention is to late bind which vulkan instance to use exactly (defer decision as long as possible--which right now is not long at all). There's a mysterious comment

// Should be changed to return device-specific host buffer type // but that probably requires changes in llama.cpp

I think libggml-vulkan.so moves from lib to bin and cmake install doesn't know that or something.

When GGML_BACKEND_DL is enabled, backends are built as MODULE targets instead of library, and one of the consequences is they go into the RUNTIME directory instead. It's not very clear where they should be installed, currently ggml only looks for backends in the same directory as the executable, so for it to even work, they would need to be installed in the bin directory, which is not great. So at the moment this is only useful for applications that handle backend loading themselves, but not as installable libraries.

That comment is just about host buffers, not immediately relevant to this.

The intention was not to defer the decision, but at the time of writing it was unclear which function would get called first, so there's a number of options that trigger initializing the instance. Not sure if that has changed.

I think it's a good idea to leave a message about not having found any Vulkan devices or failing to initialize the instance, but you should probably use the GGML debug macro for that instead of piping to std::cerr.

That was probably written before the device/reg interfaces were added. Now the device interface has a function to obtain a host buffer for that device, so ideally it should be implemented so that each device returns the correct host buffer. llama.cpp at the moment only uses the host buffer of the first device in the list of devices (which may not be the default device if the user uses the -dev argument).

jeffbolznv · 2025-01-29T19:24:00Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

Seems like a failure here will leave things in a weird partially-initialized state.

We could special-case just the one exception vk::IncompatibleDriverError that happens here--under the assumption that that one won't leave it in a partially-initialized state. What do you think?

The idea is that if the returned device count is 0 nobody will bother that backend again. So partially initialized or not--it won't be used.

Now if it left the GPU in a partially initialized state and other backends would fail using that GPU because of it, in my opinion that would be a Vulkan bug.

There are two expected cases, failure to initialize the Vulkan instance (issue with the loader) or no devices found (but an instance was created). It would probably be good to handle those inside of ggml_vk_instance_init(), to be able to clean up the instance if one was created.

0cc4m · 2025-01-29T19:33:47Z

Better to just fall back to CPU.

Is it better? What's your use case? I'm not opposed to this in princible, but it also isn't immediately problematic that the Vulkan backend requires Vulkan and a Vulkan-compatible device.

Did you check how other backends handle this case?

daym · 2025-01-29T19:35:32Z

Is it better?

Than crashing before even reading the configuration? I think so.

What's your use case? I'm not opposed to this in princible, but it also isn't immediately problematic that the Vulkan backend requires Vulkan and a Vulkan-compatible device.

The use case is that distributions can package llama.cpp once--and not have to create 2^6 different packages for the different enable/disable backend combinations.

Did you check how other backends handle this case?

I did not check that yet.

Could someone with the respective backend already compiled in please try running llama-cli -dev none in a container without GPU access?

slaren · 2025-01-29T19:39:49Z

The intention is to allow builds with multiple backends and let the application determine which ones to use at runtime. If a backend cannot work on the current system it must return null to the reg function, or return zero devices, but it must absolutely not crash the application.

0cc4m · 2025-01-29T19:43:07Z

The intention is to allow builds with multiple backends and let the application determine which ones to use at runtime. If a backend cannot work on the current system it must return null to the reg function, or return zero devices, but it must absolutely not crash the application.

That makes sense, I'm still used to the separated builds. Does running multiple backends together already work?

That leads to another question, too: Which backend takes priority? How do you avoid using the same device twice with two backends?

slaren · 2025-01-29T19:54:10Z

It does work, especially with GGML_BACKEND_DL enabled, it allows to include backends even if they require driver libraries (e.g. the CUDA backend requires an NVIDIA driver to even load). So it is already possible to include any number of backends in a build.

That leads to another question, too: Which backend takes priority? How do you avoid using the same device twice with two backends?

This is not solved yet, and that's one of reasons we still aren't distributing unified builds with multiple backends. However, the user can manually specify which backend/devices to use with the -dev argument.

daym · 2025-01-29T22:40:39Z

I changed it to initialize on reg on. Tested it and it still works.

slaren

I have not tested the changes, but the logic looks correct.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

0cc4m

Thank you, LGTM

…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

ericcurtin · 2025-06-09T14:55:50Z

If -ngl 0 is selected should we try and trigger this too? Getting reports ggml-cpu is faster than vulkan with -ngl 0...

0cc4m · 2025-06-09T16:29:52Z

If -ngl 0 is selected should we try and trigger this too? Getting reports ggml-cpu is faster than vulkan with -ngl 0...

No, definitely not. There are cases where that may happen, but in most cases it is advantageous to run large matmul ops on the GPU. That is what happens when you run a GPU backend with -ngl 0.

ericcurtin · 2025-06-09T19:42:27Z

If -ngl 0 is selected should we try and trigger this too? Getting reports ggml-cpu is faster than vulkan with -ngl 0...

No, definitely not. There are cases where that may happen, but in most cases it is advantageous to run large matmul ops on the GPU. That is what happens when you run a GPU backend with -ngl 0.

The kinds of cases are ones like when the vulkan device is actually a CPU:

    if (props2.properties.deviceType == vk::PhysicalDeviceType::eCpu) {
        GGML_LOG_DEBUG("ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want.\n");
    }

0cc4m · 2025-06-10T06:20:33Z

Most of those cases should be covered, CPU devices do not get picked by default unless there is no other device available, and CPU devices are rare. I'm only aware of llvmpipe on Linux. But you are right, for that case there is an argument for defaulting to no offload. We currently use llvmpipe for the CI test-backend-ops run, but that can still be done by manually selecting the device.

If you wanna submit a PR that changes that behaviour, that's okay with me.

0cc4m · 2025-06-10T07:53:55Z

Seeing containers/ramalama#1479 I now understand more clearly what you meant, please link the related issue next time. I'll take care of it. 👍

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 29, 2025

jeffbolznv reviewed Jan 29, 2025

View reviewed changes

daym force-pushed the issue-11493 branch 4 times, most recently from 78610e7 to bc7f4bb Compare January 29, 2025 22:20

daym requested review from 0cc4m, jeffbolznv and slaren January 29, 2025 22:41

slaren approved these changes Jan 30, 2025

View reviewed changes

jeffbolznv requested changes Jan 31, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

daym force-pushed the issue-11493 branch from bc7f4bb to e442de4 Compare February 7, 2025 17:36

vulkan: Make Vulkan optional at runtime (ggml-org#11493).

2745606

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

daym force-pushed the issue-11493 branch from e442de4 to 2745606 Compare February 7, 2025 22:22

0cc4m approved these changes Feb 9, 2025

View reviewed changes

0cc4m requested a review from jeffbolznv February 9, 2025 08:32

jeffbolznv approved these changes Feb 9, 2025

View reviewed changes

0cc4m merged commit c2a67ef into ggml-org:master Feb 10, 2025
43 of 46 checks passed

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025

vulkan: Make Vulkan optional at runtime (ggml-org#11493). (ggml-org#1…

8303878

…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

vulkan: Make Vulkan optional at runtime (ggml-org#11493). (ggml-org#1…

a19a443

…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

ubergarm pushed a commit to ubergarm/llama.cpp that referenced this pull request Mar 1, 2025

vulkan: Make Vulkan optional at runtime (ggml-org#11493). (ggml-org#1…

46298c3

…1494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

vulkan: Make Vulkan optional at runtime (#11493). #11494

vulkan: Make Vulkan optional at runtime (#11493). #11494

Uh oh!

Conversation

daym commented Jan 29, 2025

Uh oh!

jeffbolznv Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

daym Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daym Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

daym Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daym Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

daym Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0cc4m Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Jan 29, 2025

Uh oh!

daym commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jan 29, 2025

Uh oh!

0cc4m commented Jan 29, 2025

Uh oh!

slaren commented Jan 29, 2025

Uh oh!

daym commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericcurtin commented Jun 9, 2025

Uh oh!

0cc4m commented Jun 9, 2025

Uh oh!

ericcurtin commented Jun 9, 2025

Uh oh!

0cc4m commented Jun 10, 2025

Uh oh!

0cc4m commented Jun 10, 2025

Uh oh!

Reviewers

daym Jan 29, 2025 •

edited

Loading

daym Jan 29, 2025 •

edited

Loading

daym Jan 29, 2025 •

edited

Loading

0cc4m Jan 29, 2025 •

edited

Loading

daym Jan 29, 2025 •

edited

Loading

daym commented Jan 29, 2025 •

edited

Loading

daym commented Jan 29, 2025 •

edited

Loading