[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. #29546

yewentao256 · 2025-11-26T21:24:43Z

Purpose

Fused layout transform with per token group quant to get performance

Namely, pack scales into a uint32 earlier and remove an additional kernel call

Test

vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 --enable-expert-parallel --port 9256 --enforce_eager

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9568|±  |0.0056|

Perf

vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024

Now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  27.94     
Total input tokens:                      3072      
Total generated tokens:                  262144    
Request throughput (req/s):              36.65     
Output token throughput (tok/s):         9382.68   
Peak output token throughput (tok/s):    10240.00  
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          9492.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          1013.22   
Median TTFT (ms):                        982.19    
P99 TTFT (ms):                           1157.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.02    
Median TPOT (ms):                        105.08    
P99 TPOT (ms):                           105.26    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.02    
Median ITL (ms):                         104.57    
P99 ITL (ms):                            131.10    
==================================================

Main
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  29.15     
Total input tokens:                      3072      
Total generated tokens:                  262144    
Request throughput (req/s):              35.13     
Output token throughput (tok/s):         8994.31   
Peak output token throughput (tok/s):    10132.00  
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          9099.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          1121.89   
Median TTFT (ms):                        1144.56   
P99 TTFT (ms):                           1241.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          109.30    
Median TPOT (ms):                        109.29    
P99 TPOT (ms):                           109.40    
---------------Inter-token Latency----------------
Mean ITL (ms):                           109.30    
Median ITL (ms):                         108.89    
P99 ITL (ms):                            125.29    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request introduces a fused CUDA kernel for activation quantization and scale packing, targeting performance improvements with DeepGEMM. The changes are well-motivated and backed by performance data showing significant gains. My review focuses on the correctness and maintainability of the new CUDA kernel and its Python integration. I've identified two high-severity issues: one related to obscure and fragile bit-packing logic in the CUDA kernel that should be refactored for clarity and robustness, and another in the Python wrapper which fails to use a pre-allocated output buffer, leading to unnecessary memory allocations.

csrc/quantization/w8a8/fp8/per_token_group_quant.cu

vllm/model_executor/layers/quantization/utils/fp8_utils.py

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 282 to 283 in deecd2a

    
           torch.ops.vllm.fp8_gemm_nt_op( 
        
               q_input, input_scale, weight, weight_scale, output, self.use_deep_gemm_e8m0

DeepGEMM path packs UE8M0 scales but signals E8M0 off

The DeepGEMM linear path now quantizes activations with use_ue8m0=True, producing UE8M0-packed int32 scales, but the subsequent fp8_gemm_nt_op call still forwards self.use_deep_gemm_e8m0. When VLLM_USE_DEEP_GEMM_E8M0 is false (the default on supported GPUs), this flag is false, so DeepGEMM will interpret the scale buffer as its non-E8M0 float format while it actually contains packed exponents, leading to incorrect matmul results whenever DeepGEMM is used without E8M0 enabled.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/utils/fp8_utils.py

deepgemm fused layout kernel

deecd2a

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from mgoin, pavanimajety, robertgshaw2-redhat and tlrmchlsmth as code owners November 26, 2025 21:24

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

csrc/quantization/w8a8/fp8/per_token_group_quant.cu Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 26, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Show resolved Hide resolved

yewentao256 marked this pull request as draft November 26, 2025 21:52

update doc and comments

85dcdf8

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 marked this pull request as ready for review November 26, 2025 22:19

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2025

chatgpt-codex-connector bot reviewed Nov 26, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. #29546

[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. #29546

yewentao256 commented Nov 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	torch.ops.vllm.fp8_gemm_nt_op(
	q_input, input_scale, weight, weight_scale, output, self.use_deep_gemm_e8m0

Uh oh!

[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. #29546

Are you sure you want to change the base?

[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. #29546

Conversation

yewentao256 commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Acc

Perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yewentao256 commented Nov 26, 2025 •

edited by github-actions bot

Loading