Replies: 2 comments 1 reply
-
|
Yea, mainline's implementation kinda sucks compared to vllm or exllama. But under row I get 50% gpu utilization and under layer I only get 25%. Row
Layer
Context should be lower but not this low. For a similar model like mistral-large 5bit, I get: This is at least 80-90% utilization on all GPU. Mind you, I disable turbo mode on my 3090s so they only get up to 250w for LLM. usually sit around 200w a piece. Uncorked, the numbers might go higher. |
Beta Was this translation helpful? Give feedback.
-
|
OK, so with 4 GPUs you get 3X+ lower PP, and 20% better TG. That's still a massively worse performance on my book. Somewhere I read that with a "typical" LLM usage the prompt length is about 10 times the number of generated tokens. So, let's apply this here for a prompt of 10,000 tokens and 1000 generated tokens. That will take
You objected to me showing results for an 8B model. Here are results for
So, with 2 GPU's, even a 70B dense model is worse with split mode "row". It looks even worse for a MoE model, but I don't want to fill up this discussion with too many comparisons. My main objective was to establish if anyone has gotten meaningfully better performance with mainline |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to start looking into implementing tensor parallel (TP) CUDA inference in
ik_llama.cpp.I decided that my first step is to see what is the current status in
ik_llama.cppandllama.cpp. Split mode "row" is the option that is currently supposed to achieve at least some tensor parallelism. Here is what I find for `-sm row"ik_llama.cppis not functional with-sm rowat all (it crashes). With-no-fugit does run, but the result is wrong.llama.cppworks, but as far as I can tell, it is much slower than split mode "layer".Here is an example on a 2x3090 system for
Q4_0quantized 8B parameter dense model:I do remember some people claiming that split mode "row" leads to performance gains in
llama.cpp, so I'm wondering if I'm doing something wrong. My command line isI have tried adding
-ts 50/50, but this does not change the result.So, my question is: am I missing something?
Beta Was this translation helpful? Give feedback.
All reactions