[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight #4036

LHXuuu · 2025-11-06T09:57:54Z

What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.

Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm.
Support CompressedTensorsW8A8 static weight.
- weight: per-tensor, int8, symmetric; activation: per-tensor, int8, symmetric.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
Modify the override_quantization_method in AscendQuantConfig.
Add AscendCompressedTensorsLinearMethod in WEIGHT_LOADER_V2_SUPPORTED list.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.11.0rc6
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-11-06T09:58:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for w8a8 static and dynamic quantization using the compressed tensors format on Ascend hardware. The changes include a new AscendCompressedTensorsConfig, corresponding quantization schemes, and integration into the vLLM-Ascend platform and worker.

The implementation looks good overall, but I've found a few issues:

A critical bug in AscendCompressedTensorsConfig that could lead to a runtime crash due to a missing None check.
Some robustness issues, such as an unsafe list removal and the use of assert for configuration validation, which could cause crashes.
A performance issue in the w8a8 static quantization scheme where a transpose operation is inefficiently performed on every forward pass.

I've provided detailed comments and suggestions to address these points.

vllm_ascend/quantization/compressed_tensors/compressed_tensors.py

gemini-code-assist · 2025-11-06T10:01:24Z

vllm_ascend/quantization/compressed_tensors/schemes/compressed_tensors_w8a8.py

+        if is_310p():
+            # On 300I Duo platform, we need transpose again if
+            # using nz. This transpose can be skipped in torchair.
+            output = torch_npu.npu_quant_matmul(
+                x,
+                layer.weight.data.transpose(1, 0),
+                layer.deq_scale,
+                bias=bias,
+                output_dtype=layer.params_dtype,
+            )


The transpose operation on layer.weight.data is performed on every forward pass for the is_310p() case, which is inefficient. The transposed weight should be computed once and cached to improve performance. A good place for this one-time operation would be in process_weights_after_loading.

if is_310p(): # On 300I Duo platform, we need transpose again if # using nz. This transpose can be skipped in torchair. # The transpose is cached to avoid re-computation on every forward pass. if not hasattr(layer, "_weight_transposed_for_310p"): layer._weight_transposed_for_310p = layer.weight.data.transpose(1, 0).contiguous() output = torch_npu.npu_quant_matmul( x, layer._weight_transposed_for_310p, layer.deq_scale, bias=bias, output_dtype=layer.params_dtype, )

MengqingCao · 2025-11-07T07:16:00Z

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

MengqingCao · 2025-11-07T07:19:58Z

You can solve the DCO and lint issues by referring to the contributing doc in https://vllm-ascend.readthedocs.io/

LHXuuu · 2025-11-07T07:51:57Z

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

Thanks for your reply. I’m currently running accuracy and performance tests. Once they’re complete, I’ll post them in the comment.

LHXuuu added 6 commits November 6, 2025 12:08

support compressed tensors w8a8 static and dynamic quantization

61812a8

fix w8a8 static precision

3a4d94c

fix w8a8 dynamic precision

355e9c1

fix weight loader

f421007

register ascend compressed-tensors quantizaiton config

592bbbe

clean code

dbe22ad

github-actions bot added module:core module:quantization labels Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

LHXuuu added 3 commits November 6, 2025 21:23

clean code

0e7029b

clean code

1c0e40d

add init file

7eb50fb

LHXuuu closed this Nov 7, 2025

LHXuuu reopened this Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight #4036

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight #4036

LHXuuu commented Nov 6, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

MengqingCao commented Nov 7, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

LHXuuu commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight #4036

Are you sure you want to change the base?

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight #4036

Conversation

LHXuuu commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Nov 7, 2025

Uh oh!

LHXuuu commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LHXuuu commented Nov 6, 2025 •

edited

Loading

MengqingCao commented Nov 7, 2025 •

edited

Loading