speculative decoding via network? #839

magikRUKKOLA · 2025-10-17T22:44:59Z

magikRUKKOLA
Oct 17, 2025

Is it possible to run the inference with speculative decoding for MoE models such as DeepSeek-V3.1-Terminus via two machines inside LAN? That is, say I have a THIREUS-5.4498bpw-R4 at one machine and say, IQ2_KS-2.472bpw at the draft-machine. Since the second quant is somewhat sane but more than two time smaller (and the decode speed is limited to ram bw), the boost supposed to be quite significant? like 60-70%? Will it work?

REFS:

ggml-org/llama.cpp#6853

Here's a small illustration i made: https://github.com/ggml-org/llama.cpp/commit/c8d446df78664726ef1d2e70aa787b235813780e

    It runs main model (Llama3-70-Q8) on M2 Ultra GPU
    It runs draft model (Llama3-8-Q5) on 16 cpu perf cores of the same M2 Ultra
...

ggml-org/llama.cpp#6853 (comment)

you might find our [distributed speculative decoding](https://openreview.net/forum?id=cJd1BgZ9CS) algorithm useful for this CPU + GPU hardware setup. Our ICLR'25 publication includes theory and simulations demonstrating speedups over both vanilla speculative decoding (non-distributed) and autoregressive decoding—for any drafter model:

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding
[Ce Zheng](https://arxiv.org/search/eess?searchtype=author&query=Zheng,+C), [Tingting Yang](https://arxiv.org/search/eess?searchtype=author&query=Yang,+T)

    Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In AI-native radio access networks (AI-RAN), this paradigm is well-suited for collaborative inference between resource-constrained end devices and more capable edge servers or base stations (BSs). However, existing distributed speculative decoding requires transmitting the full vocabulary probability distribution from the draft model on the device to the target model at the BS, which leads to prohibitive uplink communication overhead. To address this issue, we propose a ``Top-K Sparse Logits Transmission (TK-SLT)`` scheme, where the draft model transmits only the top-K token raw probabilities and the corresponding token indices instead of the entire distribution. This approach significantly reduces bandwidth consumption while maintaining inference performance. We further derive an analytical expression for the optimal draft length that maximizes inference throughput, and provide a theoretical analysis of the achievable speedup ratio under TK-SLT. Experimental results validate both the efficiency and effectiveness of the proposed method.

magikRUKKOLA · 2025-10-17T22:50:30Z

magikRUKKOLA
Oct 17, 2025
Author

If that works, it means that it makes sense to build one machine with 512GB RAM and another, a draft-machine with only 256GB RAM (to run a smaller quant). The price difference between 8x64GB 2666MT/s ECC and simple 8x32GB 3200MT/s non-ECC is quite significant. Right?

0 replies

saood06 · 2025-10-17T23:12:32Z

saood06
Oct 17, 2025
Collaborator

It is not currently supported here, there is an open request (#785) for what would add support for it.

0 replies

magikRUKKOLA · 2025-10-18T13:28:08Z

magikRUKKOLA
Oct 18, 2025
Author

ha! I was just experimenting with speculative decoding and found out that I can speed up the decode by ~~about 15%~~ up to 100% for Qwen3-Coder-480B-A35B-Instruct-5.1546bpw.

I took the Qwen3-Coder-30B-A3B IQ1_KT as draft model for Qwen3-Coder-480B-A35B-Instruct-5.1546bpw. To fit everything onto the GPU with a decent context size if had to use -ctv q8_0 and -ctkd q4_0 -ctvd q4_0 plus I had to reduce the batch sizes to 4k.

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \                                                                     
    --model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \                                                                                                      
    --model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \                    --draft-max 16 \                                                                                                        
    --alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \                                                              
    --ctx-size $((96 * 1024)) \                                                                                             
    --ctx-size-draft $((96 * 1024)) \                                                                                       
    -b 4096 -ub 4096 \                                                                                                      
    --mlock \                                                                                                               
    --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \                                                   
    -ctk q8_0 -ctv q8_0 \                                                                                                   
    -ctkd q4_0 -ctvd q4_0 \                                                                                                 
    -fa \                                                                                                                   
    -fmoe \                                                                                                                 
    -amb 512 \                                                                                                              
    --seed 3407 \                                                                                                           
    --split-mode layer \                                                                                                    
    --tensor-split 32,40 \                                                                                                  
    --main-gpu 1 \                                                                                                          
    --override-tensor exps=CPU \                                                                                            
    --gpu-layers 99 \                                                                                                       
    --gpu-layers-draft 99 \                                                                                                 
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \                  
    --threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \            
    --host 0.0.0.0 \                                                                                                        
    --port 8080 \                                                                                                           
    --log-enable \                                                                                                          
    --logdir /var/log/ \                                                                                                    
    --jinja \                                                                                                               
    --special \                                                                                                             
    --verbose-prompt --verbosity 2 \                                                                                        
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \                                        
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \                                                                 
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \                                                           
    --keep -1 \                                                                                                             
    --slot-prompt-similarity 0.35 \                                                                                         
    --metrics

so it does seem to be working. Since llama-bench doesn't support the speculative decoding, I had to test it manually. Like:

pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | mods -m qwen3 "explain the code"

that resulted in a TG performance boost from 6.6tps to 7.81tps.

w/o spec. decoding:

INFO [           print_timings] prompt eval time     =   71046.52 ms / 19146 tokens (    3.71 ms per token,   269.49 tokens per second) | tid="140589686403072" timestamp=1760764217 id_slot=0 id_task=0 t_prompt_processing=71046.523 n_prompt_tokens_processed=19146 t_token=3.710776297921237 n_tokens_second=269.4853905799162
INFO [           print_timings] generation eval time =  223645.05 ms /  1477 runs   (  151.42 ms per token,     6.60 tokens per second) | tid="140589686403072" timestamp=1760764217 id_slot=0 id_task=0 t_token_generation=223645.049 n_decoded=1477 t_token=151.41844888287068 n_tokens_second=6.604215056868976

with spec. decoding:

INFO [           print_timings] prompt eval time     =  105156.30 ms / 19146 tokens (    5.49 ms per token,   182.07 tokens 
per second) | tid="140399929704448" timestamp=1760793199 id_slot=0 id_task=0 t_prompt_processing=105156.302 n_prompt_tokens_
processed=19146 t_token=5.492337929593648 n_tokens_second=182.0718267555662
INFO [           print_timings] generation eval time =  209608.00 ms /  1638 runs   (  127.97 ms per token,     7.81 tokens 
per second) | tid="140399929704448" timestamp=1760793199 id_slot=0 id_task=0 t_token_generation=209607.998 n_decoded=1638 t_
token=127.96581074481074 n_tokens_second=7.814587304058884
INFO [           print_timings]           total time =  314764.30 ms | tid="140399929704448" timestamp=1760793199 id_slot=0 

VERB [              operator()] data stream | tid="140371829452800" timestamp=1760793199 to_send="data: {\"choices\":[],\"cr
eated\":1760793199,\"id\":\"chatcmpl-5fbLFTeNRksT7Z21DoJvR5Dx0Ht7ZTNh\",\"model\":\"Qwen3-Coder-480B-A35B-Instruct-5.1546bpw
\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1638,\"prompt_tokens\":19146,\"total_tokens\":20784
},\"timings\":{\"prompt_n\":19146,\"prompt_ms\":105156.302,\"prompt_per_token_ms\":5.492337929593648,\"prompt_per_second\":1
82.0718267555662,\"predicted_n\":1638,\"predicted_ms\":209607.998,\"predicted_per_token_ms\":127.96581074481074,\"predicted_
per_second\":7.814587304058884,\"draft_n\":1093,\"draft_n_accepted\":810}}\n\n"

So yeah, its working. It would be nice to have the draft model run at a different network-connected device for sure.

BTW the --seed parameter doesn't seem to be working for a draft model?

[EDIT]:

Kek the actual decode speed is higher on a real data I am using (I am asking the LLM to rewrite the code so there is alot of repetitive segments so the draft model does the job of copy/pasting and the acceptance rate is quite high). Its 10.74 tps! Ha!

lol what a funny technique! I wonder the same can be applied to other models.

 VERB [              operator()] data stream | tid="139636347760640" timestamp=1760797564 to_send="data: {\"choices\":[],\"created\":1760797564,\"id\":\"chatcmpl-HyjiL9mQnREv72nGIV3Q8tIBLMB4H9zX\",\"model\":\"Qwen3-Coder-480B-A35B-Instruct-5.1546bpw\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1750,\"prompt_tokens\":18477,\"total_tokens\":20227},\"timings\":{\"prompt_n\":18477,\"prompt_ms\":101315.749,\"prompt_per_token_ms\":5.483344103480002,\"prompt_per_second\":182.37046246383667,\"predicted_n\":1750,\"predicted_ms\":162961.907,\"predicted_per_token_ms\":93.12108971428572,\"predicted_per_second\":10.738705948010292,\"draft_n\":1492,\"draft_n_accepted\":1340}}\n\n"

[EDIT2]:

Kek2 lol 12.77 tps in decode:

VERB [              operator()] data stream | tid="139636330975232" timestamp=1760799119 to_send="data: {\"choices\":[],\"created\":1760799119,\"id\":\"chatcmpl-vlWrtENX5pXzFNAg8zBO667OcqiaVuGV\",\"model\":\"Qwen3-Coder-480B-A35B-Instruct-5.1546bpw\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":18759,\"prompt_tokens\":20262,\"total_tokens\":39021},\"timings\":{\"prompt_n\":36,\"prompt_ms\":976.89,\"prompt_per_token_ms\":27.135833333333334,\"prompt_per_second\":36.851641433528854,\"predicted_n\":18759,\"predicted_ms\":1468782.208,\"predicted_per_token_ms\":78.29746830854523,\"predicted_per_second\":12.771805035372543,\"draft_n\":16706,\"draft_n_accepted\":16375}}\n\n"

lol its basically like 100% improvement in decode.

0 replies

ikawrakow · 2025-10-18T14:17:16Z

ikawrakow
Oct 18, 2025
Maintainer

BTW the --seed parameter doesn't seem to be working for a draft model?

Yes, there is very little one can set for the draft model, and the seed is not one of the things that one can adjust.

2 replies

magikRUKKOLA Oct 18, 2025
Author

@ikawrakow

BTW the --seed parameter doesn't seem to be working for a draft model?

Yes, there is very little one can set for the draft model, and the seed is not one of the things that one can adjust.

So right now there is no way to do the reproducible results via the speculative decoding? (that is, each time it responds slightly differenly even though the --seed parameter is specified).

There is no support for speculative decoding both in llama-perplexily and llama-sweep-bench as well, right?

ikawrakow Oct 18, 2025
Maintainer

There is no token generation involved when evaluating perplexity, so no point in having a draft model.

There is no support for a draft model in llama-bench or llama-sweep-bench. But these tools use random data, so even if one would make the effort to support a draft model, results would not be representative for real world usage (where we know that the success rate of the draft model and corresponding speedup (or slowdown) is highly context dependent). So, then, to have a proper benchmark one would need to start adding the ability to load and use specific contexts, which would make things way more complicated.

My concept was that most people use greedy sampling in the draft model, so the seed does not play a role. If one wanted to use an actual sampling sequence for the draft model, then ideally it would be the same as the one used in the main model, and then one would also implement correlated sampling (i.e., use the same random number sequence for sampling in the draft and in the main model) to increase the chance of accepting the draft sequence.

So, in short, there is a lot one could improve around draft model usage.

magikRUKKOLA · 2025-10-21T06:56:28Z

magikRUKKOLA
Oct 21, 2025
Author

Tried to do the similar with Ling-1T and its working. 10% (+/-) the boost in decode with Q2_K official quant from Ling-Flash-2.0.

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \                                                                     
    --ctx-size $((32 * 1024)) \                                                                                             
    --ctx-size-draft $((32 * 1024)) \                                                                                       
    --model-draft /opt/inclusionAI/Ling-flash-2.0-GGUF/Q2_K/Ling-flash-2.0-Q2_K.gguf \                                      
    --draft-max 16 \                                                                                                        
    -ctkd q8_0 -ctvd q8_0 \                                                                                                 
    --gpu-layers-draft 99 \                                                                                                 
    --model /opt/ubergarm/Ling-1T-GGUF/smol-IQ4-KSS/Ling-1T-smol-IQ4_KSS-00001-of-00011.gguf \                              
    --alias ubergarm/Ling-1T-smol-IQ4_KSS-test \                                                                            
    -b $((8 * 512)) -ub $((8 * 512)) \                                                                                      
    --mlock \                                                                                                               
    --temp 0.7 --top-k 0 --top-p 0.95 --min-p 0.1 --repeat-penalty 1.1 \                                                    
    -ctk q8_0 -ctv q8_0 \                                                                                                   
    -fa \                                                                                                                   
    -fmoe \                                                                                                                 
    -amb 512 \                                                                                                              
    --split-mode layer \                                                                                                    
    --tensor-split 1,2,50 \                                                                                                 
    --main-gpu 1 \                                                                                                          
    --override-tensor exps=CPU \                                                                                            
    -no-ooae \                                                                                                              
    --n-gpu-layers 99 \                                                                                                     
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \                  
    --host 0.0.0.0 \                                                                                                        
    --port 8080 \                                                                                                           
    --log-enable \                                                                                                          
    --logdir /var/log/ \                                                                                                    
    --jinja \                                                                                                               
    --special \                                                                                                             
    --verbosity 2 \                                                                                                         
    --verbose-prompt \                                                                                                      
    --reasoning-format auto \                                                                                               
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \                                        
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \                                                                 
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \                                                           
    --keep -1 \                                                                                                             
    --slot-prompt-similarity 0.35 \                                                                                         
    --metrics

41 replies

ikawrakow Oct 25, 2025
Maintainer

OK, fmoe and flash attention are now on by default on the main branch. I'm curious to see if this will improve the speculative performance.

magikRUKKOLA Oct 25, 2025
Author

@saood06

Make an additional run with Q8_0 KV for the draft model.

If compared to the same config there is a performance degradation (~0.15 tps and about %1 in acceptance rate).

magikRUKKOLA Oct 25, 2025
Author

@ikawrakow

. I'm curious to see if this will improve the speculative performance.

Yes, it does. Its about +0.16 tps for now (6.88 tps vs 7.04 tps). Here are the prelim results:

[EDIT]:

the only strange thing is that despite the improvement in speed (both of the draft LLM, and, consequently, the target LLM) the acceptance rate got reduced.
Does that mean that the draft LLM produces lesser quality results with such config? Or, what possibly could affect that?

magikRUKKOLA Oct 25, 2025
Author

I will be (re-)testing the IQ2_KL with the current master both in f16 and q8_0 with various configs. In case if anyone got some more ideas let me know. As of now it seems that I need to add a fourth GPU.

magikRUKKOLA Oct 26, 2025
Author

IQ2_KL draft performance improved too. PR-863 and the old graph for comparison:

Interestingly, the performance of IQ2_KL is almost the same as IQ1_KT even though the acceptance rate is much higher for the IQ2_KL (+4%).

speculative decoding via network? #839

Uh oh!

Uh oh!

magikRUKKOLA Oct 17, 2025

Replies: 5 comments · 43 replies

Uh oh!

magikRUKKOLA Oct 17, 2025 Author

Uh oh!

saood06 Oct 17, 2025 Collaborator

Uh oh!

Uh oh!

magikRUKKOLA Oct 18, 2025 Author

Uh oh!

ikawrakow Oct 18, 2025 Maintainer

Uh oh!

magikRUKKOLA Oct 18, 2025 Author

Uh oh!

ikawrakow Oct 18, 2025 Maintainer

Uh oh!

magikRUKKOLA Oct 21, 2025 Author

Uh oh!

ikawrakow Oct 25, 2025 Maintainer

Uh oh!

magikRUKKOLA Oct 25, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Oct 25, 2025 Author

Uh oh!

magikRUKKOLA Oct 25, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Oct 26, 2025 Author

magikRUKKOLA
Oct 17, 2025

Replies: 5 comments 43 replies

magikRUKKOLA
Oct 17, 2025
Author

saood06
Oct 17, 2025
Collaborator

magikRUKKOLA
Oct 18, 2025
Author

ikawrakow
Oct 18, 2025
Maintainer

magikRUKKOLA Oct 18, 2025
Author

ikawrakow Oct 18, 2025
Maintainer

magikRUKKOLA
Oct 21, 2025
Author

ikawrakow Oct 25, 2025
Maintainer

magikRUKKOLA Oct 25, 2025
Author

magikRUKKOLA Oct 25, 2025
Author

magikRUKKOLA Oct 25, 2025
Author

magikRUKKOLA Oct 26, 2025
Author