-
Notifications
You must be signed in to change notification settings - Fork 163
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Hello, thanks for all your work as always.
I was wondering if RPC updates and changes could be ported from main lcpp into iklcpp.
Nowadays you can get pretty good performance, even when offloading to RAM and also using RPC.
I have these examples on GLM 4.6 full on VRAM on lcpp: ggml-org/llama.cpp#16625 (reply in thread)
I also tried i.e. DeepSeek R1 0528 Q3_K_XL offloading about 25 layers to CPU, 5 layers to RPC (CUDA) and 30 layers to the main PC (CUDA) and got just about 10-20% perf penalty.
Motivation
This would let people use multiple PCs to offload more into devices like CUDA, that in example, on my case, doesn't fit anymore on another system.
Speed penalty is pretty low so it would be worth to consider maybe.
Possible Implementation
Not sure exactly besides porting RPC stuff into here.
As @ubergarm mentioned here https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/13#691bb8df9287514645b7cc35, it seems to start on ggml-org/llama.cpp#16276 commit.