Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Nov 26, 2025

Purpose

Fused layout transform with per token group quant to get performance

Namely, pack scales into a uint32 earlier and remove an additional kernel call

Test

vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 --enable-expert-parallel --port 9256 --enforce_eager

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match||0.9568|±  |0.0056|

Perf

vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024

Now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  27.94     
Total input tokens:                      3072      
Total generated tokens:                  262144    
Request throughput (req/s):              36.65     
Output token throughput (tok/s):         9382.68   
Peak output token throughput (tok/s):    10240.00  
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          9492.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          1013.22   
Median TTFT (ms):                        982.19    
P99 TTFT (ms):                           1157.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.02    
Median TPOT (ms):                        105.08    
P99 TPOT (ms):                           105.26    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.02    
Median ITL (ms):                         104.57    
P99 ITL (ms):                            131.10    
==================================================

Main
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  29.15     
Total input tokens:                      3072      
Total generated tokens:                  262144    
Request throughput (req/s):              35.13     
Output token throughput (tok/s):         8994.31   
Peak output token throughput (tok/s):    10132.00  
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          9099.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          1121.89   
Median TTFT (ms):                        1144.56   
P99 TTFT (ms):                           1241.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          109.30    
Median TPOT (ms):                        109.29    
P99 TPOT (ms):                           109.40    
---------------Inter-token Latency----------------
Mean ITL (ms):                           109.30    
Median ITL (ms):                         108.89    
P99 ITL (ms):                            125.29    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fused CUDA kernel for activation quantization and scale packing, targeting performance improvements with DeepGEMM. The changes are well-motivated and backed by performance data showing significant gains. My review focuses on the correctness and maintainability of the new CUDA kernel and its Python integration. I've identified two high-severity issues: one related to obscure and fragile bit-packing logic in the CUDA kernel that should be refactored for clarity and robustness, and another in the Python wrapper which fails to use a pre-allocated output buffer, leading to unnecessary memory allocations.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

torch.ops.vllm.fp8_gemm_nt_op(
q_input, input_scale, weight, weight_scale, output, self.use_deep_gemm_e8m0

P1 Badge DeepGEMM path packs UE8M0 scales but signals E8M0 off

The DeepGEMM linear path now quantizes activations with use_ue8m0=True, producing UE8M0-packed int32 scales, but the subsequent fp8_gemm_nt_op call still forwards self.use_deep_gemm_e8m0. When VLLM_USE_DEEP_GEMM_E8M0 is false (the default on supported GPUs), this flag is false, so DeepGEMM will interpret the scale buffer as its non-E8M0 float format while it actually contains packed exponents, leading to incorrect matmul results whenever DeepGEMM is used without E8M0 enabled.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@yewentao256 yewentao256 marked this pull request as draft November 26, 2025 21:52
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 marked this pull request as ready for review November 26, 2025 22:19
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants