Skip to content

Conversation

@LHXuuu
Copy link

@LHXuuu LHXuuu commented Nov 6, 2025

What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format.

  1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm.
  2. Support CompressedTensorsW8A8 static weight.
    • weight: per-tensor, int8, symmetric; activation: per-tensor, int8, symmetric.
    • weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
  3. Support CompressedTensorsW8A8Dynamic weight.
    • weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
  4. Modify the override_quantization_method in AscendQuantConfig.
  5. Add AscendCompressedTensorsLinearMethod in WEIGHT_LOADER_V2_SUPPORTED list.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for w8a8 static and dynamic quantization using the compressed tensors format on Ascend hardware. The changes include a new AscendCompressedTensorsConfig, corresponding quantization schemes, and integration into the vLLM-Ascend platform and worker.

The implementation looks good overall, but I've found a few issues:

  • A critical bug in AscendCompressedTensorsConfig that could lead to a runtime crash due to a missing None check.
  • Some robustness issues, such as an unsafe list removal and the use of assert for configuration validation, which could cause crashes.
  • A performance issue in the w8a8 static quantization scheme where a transpose operation is inefficiently performed on every forward pass.

I've provided detailed comments and suggestions to address these points.

Comment on lines +111 to +120
if is_310p():
# On 300I Duo platform, we need transpose again if
# using nz. This transpose can be skipped in torchair.
output = torch_npu.npu_quant_matmul(
x,
layer.weight.data.transpose(1, 0),
layer.deq_scale,
bias=bias,
output_dtype=layer.params_dtype,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The transpose operation on layer.weight.data is performed on every forward pass for the is_310p() case, which is inefficient. The transposed weight should be computed once and cached to improve performance. A good place for this one-time operation would be in process_weights_after_loading.

        if is_310p():
            # On 300I Duo platform, we need transpose again if
            # using nz. This transpose can be skipped in torchair.
            # The transpose is cached to avoid re-computation on every forward pass.
            if not hasattr(layer, "_weight_transposed_for_310p"):
                layer._weight_transposed_for_310p = layer.weight.data.transpose(1, 0).contiguous()
            output = torch_npu.npu_quant_matmul(
                x,
                layer._weight_transposed_for_310p,
                layer.deq_scale,
                bias=bias,
                output_dtype=layer.params_dtype,
            )

@MengqingCao
Copy link
Collaborator

MengqingCao commented Nov 7, 2025

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

@MengqingCao
Copy link
Collaborator

You can solve the DCO and lint issues by referring to the contributing doc in https://vllm-ascend.readthedocs.io/

@LHXuuu LHXuuu closed this Nov 7, 2025
@LHXuuu LHXuuu reopened this Nov 7, 2025
@LHXuuu
Copy link
Author

LHXuuu commented Nov 7, 2025

Thanks for this great work! Could you plz add an e2e test of w8a8 static and dynamic quant? And ut is also expected, but we could add ut in the follow-up prs.

And is there any accuracy and performance mertics of your pr?

also cc @wangxiyuan @22dimensions

Thanks for your reply. I’m currently running accuracy and performance tests. Once they’re complete, I’ll post them in the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants