[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage #5155

freeliuzc · 2025-11-20T13:32:12Z

Motivation

支持MTP下的静态C8量化
优化显存分配

Modifications

在 Speculative Decoding（MTP）中为静态 CacheKV 实现 C8 量化。
优化推理过程中的内存使用。
重构相关数据结构和 Kernel 路径，确保与现有推理流程兼容。

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-20T13:32:24Z

Thanks for your contribution!

yuanlehome · 2025-11-21T06:06:23Z

fastdeploy/config.py

                else:
                    self.scheduler_config.max_num_batched_tokens = self.model_config.max_model_len

+        self.scheduler_config.max_chunk_len = (


max_chunk_len建议命名再准确达意点儿，现在听起来模糊不清

可以的，我理解这里的最合适的是 max_num_batched_tokens，但是被占用了，你觉得哪个更合适

yuanlehome · 2025-11-21T06:07:32Z

fastdeploy/config.py

        self.pooler_config: Optional["PoolerConfig"] = field(init=False)
        self.override_pooler_config: Optional[Union[dict, "PoolerConfig"]] = None
        self.revision = None
+        self.prefix_layer_name = "layers"


prefix_layer_name同样也是，已经有一个prefix_name了，那么prefix_layer_name包不包括最前面的ernie/model字段呢

不含包，因为有个 prefix_name，所以 prefix_layer_name 用来区分

freeliuzc added 2 commits November 20, 2025 21:26

support static cachekv c8 quantization in mtp mode

d257af9

optimize memory allocation

8a6b51d

yuanlehome reviewed Nov 21, 2025

View reviewed changes

Deleter-D approved these changes Nov 21, 2025

View reviewed changes

YuanRisheng added the skip-ci: coverage label Nov 21, 2025

Sunny-bot1 approved these changes Nov 21, 2025

View reviewed changes

yuanlehome approved these changes Nov 21, 2025

View reviewed changes

freeliuzc merged commit 2d1dade into PaddlePaddle:develop Nov 21, 2025
23 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage #5155

[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage #5155

Uh oh!

freeliuzc commented Nov 20, 2025

Uh oh!

paddle-bot bot commented Nov 20, 2025

Uh oh!

yuanlehome Nov 21, 2025

Uh oh!

freeliuzc Nov 21, 2025

Uh oh!

yuanlehome Nov 21, 2025

Uh oh!

freeliuzc Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage #5155

[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage #5155

Uh oh!

Conversation

freeliuzc commented Nov 20, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 20, 2025

Uh oh!

yuanlehome Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

freeliuzc Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

freeliuzc Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants