speculative decoding via network? #839
Replies: 5 comments 43 replies
-
|
If that works, it means that it makes sense to build one machine with 512GB RAM and another, a draft-machine with only 256GB RAM (to run a smaller quant). The price difference between 8x64GB 2666MT/s ECC and simple 8x32GB 3200MT/s non-ECC is quite significant. Right? |
Beta Was this translation helpful? Give feedback.
-
|
It is not currently supported here, there is an open request (#785) for what would add support for it. |
Beta Was this translation helpful? Give feedback.
-
|
ha! I was just experimenting with speculative decoding and found out that I can speed up the decode by I took the Qwen3-Coder-30B-A3B IQ1_KT as draft model for Qwen3-Coder-480B-A35B-Instruct-5.1546bpw. To fit everything onto the GPU with a decent context size if had to use -ctv q8_0 and -ctkd q4_0 -ctvd q4_0 plus I had to reduce the batch sizes to 4k. so it does seem to be working. Since llama-bench doesn't support the speculative decoding, I had to test it manually. Like: that resulted in a TG performance boost from 6.6tps to 7.81tps. w/o spec. decoding: with spec. decoding: So yeah, its working. It would be nice to have the draft model run at a different network-connected device for sure. BTW the --seed parameter doesn't seem to be working for a draft model? [EDIT]: Kek the actual decode speed is higher on a real data I am using (I am asking the LLM to rewrite the code so there is alot of repetitive segments so the draft model does the job of copy/pasting and the acceptance rate is quite high). Its 10.74 tps! Ha! lol what a funny technique! I wonder the same can be applied to other models. [EDIT2]: Kek2 lol 12.77 tps in decode: lol its basically like 100% improvement in decode. |
Beta Was this translation helpful? Give feedback.
-
Yes, there is very little one can set for the draft model, and the seed is not one of the things that one can adjust. |
Beta Was this translation helpful? Give feedback.
-
|
Tried to do the similar with Ling-1T and its working. 10% (+/-) the boost in decode with Q2_K official quant from Ling-Flash-2.0. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to run the inference with speculative decoding for MoE models such as DeepSeek-V3.1-Terminus via two machines inside LAN? That is, say I have a THIREUS-5.4498bpw-R4 at one machine and say, IQ2_KS-2.472bpw at the draft-machine. Since the second quant is somewhat sane but more than two time smaller (and the decode speed is limited to ram bw), the boost supposed to be quite significant? like 60-70%? Will it work?
REFS:
ggml-org/llama.cpp#6853
ggml-org/llama.cpp#6853 (comment)
Beta Was this translation helpful? Give feedback.
All reactions