Help w/ Faster prefill on CPU-MoE? #718
-
|
Problem: Our first long turn (prefill) is slow on CPU-MoE: both GPUs sit ~1–10% SM during prompt digestion, only rising after tokens start. Subsequent turns are fast (cache helps), but we’d like higher GPU utilization during prefill without OOMs. Goal:
⸻ What we’re asking:
⸻ Minimal repro (single command) This is the smallest command that’s stable for us on 2×4090 and shows the issue. in host$ = Pop!_OS terminal (single-GPU server)MODEL_FIRST="$(ls -1v $HOME/models/Qwen3-Coder-480B-A35B-Instruct/Qwen3-480B-A35B-Instruct-IQ5_K-00001-of-*.gguf | head -n1)" CUDA_VISIBLE_DEVICES=1,0 $HOME/ik_llama.cpp/build/bin/llama-server Behavior:
⸻ Context / what we’ve tried
⸻ Hardware
⸻ Thanks! Any recommended -op policies, -ub/-amb ranges, -ot patterns, and NUMA/build tips for CPU-MoE prefill on 2×4090 would be hugely appreciated. Happy to run micro-sweeps and share CSVs if that helps. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 11 replies
-
|
I haven't tested the Qwen3-Coder. Personally I would rather prefer DeepSeek (considering a very poor performance of Qwen3 previous releases ... newermind). So regarding to DeepSeek specifically, I found out that 3 x RTX 3090 can handle 160k context perfectly so with the highest precision quant for DeepSeek for example (6.2bpw, from @Thireus ) I am having up to 6 tps. #477 (reply in thread) . |
Beta Was this translation helpful? Give feedback.
-
|
Another thought regarding the LLM performance. It seems like its handy to have the LLMs for a different tasks. For example, if the task is simple, one would rather prefer speed to the precision so it would make more sense to use rather small quant to get the faster performance. Otherwise, if the task is more complex, one would rather wait longer to get the best available quality (lower perplexity). If so, it would make more sense to have not one, but many machines with different quants or different hardware setups (i.e. more GPUs if the longer context is preferred). If so, why not to get the cheapest hardware possible and to build more machines with LLMs preloaded? For example, Sapphire Rapids Xeons Engineering Samples with 56C are going for $140 a pop right now lol. ) [EDIT]: that is, if NVIDIA Blackwell with 96GB VRAM is going for ... 9k EUR? ... what's the point? One could rather get for example the Xeon 56C, Gigabyte MB, 512GB RAM DDR5, some RTX 3090 etc. -- that would cost about 5k EUR. Alternatively, its Lenovo Thinkstation P620 with an additional PSU (for a second or third GPU) and DDR4 3200 which is possibly around 3.5k EUR. Lol so one would have two machines which are able to run 120k context or better for a price of one GPU? I can't see how that make sense. |
Beta Was this translation helpful? Give feedback.
-
|
I cannot run Qwen3-Coder-480B-A35B myself and there haven't been any discussions about this model here, so I have never seen logs. Can you post the full output from starting the server? To give suggestions, I need to see KV and compute buffer sizes, where tensors are stored, etc. Thanks! |
Beta Was this translation helpful? Give feedback.
@QuantumPlayDev
As ik mentions, having some logs would be useful. Thanks for posting your full command. I'll try to answer what you asked and add some thoughts.
Thoughts
-ub 4096 -b 4096is pretty good spot if you don't OOM.-ambas it is only for-mlastyle quants.-DGGML_SCHED_MAX_COPIES=1which can be useful for multi-GPU in general to reduce OOM (though mostly imortant for MLA quants psure).--threads-batch 24or the e…