-
Notifications
You must be signed in to change notification settings - Fork 544
【WIP】glm4.5 support mtp #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
【WIP】glm4.5 support mtp #4072
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for glm4_moe_mtp in speculative decoding. The changes correctly integrate the new model type across the configuration, proposer selection, and model execution logic. The fix to separate glm4_moe_mtp from the deepseek_mtp method is appropriate. My main feedback concerns code duplication in the configuration file, where adding support for the new model type has exacerbated a pre-existing maintainability issue. A refactoring to consolidate similar logic would be beneficial.
| elif (self.draft_model_config.hf_config.model_type == | ||
| "glm4_moe_mtp"): | ||
| self.method = "glm4_moe_mtp" | ||
| if self.num_speculative_tokens > 1: | ||
| logger.warning( | ||
| "All GLM4 MTP models only have " \ | ||
| "one layer. Might need some code changes " \ | ||
| "to support multiple layers." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif block for glm4_moe_mtp is nearly identical to the blocks for ernie_mtp, qwen3_next_mtp, and longcat_flash_mtp. This code duplication makes the code harder to maintain and can lead to inconsistencies if one block is updated but others are not.
A refactoring to group these similar MTP model types would improve maintainability. For example, you could use a dictionary to handle these model types in a single, generalized block:
# Example of a possible refactoring
MTP_MODEL_TYPES_WITH_OWN_METHOD = {
"ernie_mtp": "Ernie",
"glm4_moe_mtp": "GLM4",
"qwen3_next_mtp": "Qwen3Next",
"longcat_flash_mtp": "LongCat",
}
model_type = self.draft_model_config.hf_config.model_type
# ...
elif model_type in MTP_MODEL_TYPES_WITH_OWN_METHOD:
self.method = model_type
if self.num_speculative_tokens > 1:
model_name = MTP_MODEL_TYPES_WITH_OWN_METHOD[model_type]
logger.warning(
f"All {model_name} MTP models only have "
"one layer. Might need some code changes "
"to support multiple layers."
)
# ...While a full refactoring is outside the scope of the current changes, it's a valuable improvement to consider for code health.
What this PR does / why we need it?
Need to modify vllm:
class
SharedHeadadd__init__(): addprefix=maybe_prefix(prefix,"head")toParallelLMHeadclass
Glm4MoeMultiTokenPredictorLayer__init__(): addprefix=maybe_prefix(prefix,"shared_head")toSharedHeadDoes this PR introduce any user-facing change?
How was this patch tested?
vllm serve /home/glm4.5_w8a8_with_float_mtp/ \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --seed 1024 \ --served-model-name dsv3 \ --max-model-len 35000 \ --max-num-batched-tokens 16384 \ --max-num-seqs 32 \ --no-enable-prefix-caching \ --quantization ascend \ --trust-remote-code \ --enable-expert-parallel \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "model":"/home/glm4.5_w8a8_with_float_mtp", "method":"glm4_moe_mtp"}' \ --additional-config \ '{"ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":true}}' \ --compilation-config \ '{"cudagraph_capture_sizes": [1,4,8,16,32,48,64]}' \