Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Nov 26, 2025

This affected performance with async scheduling since requests could be unscheduled before they were finished and have to be scheduled again for the last 1-2 tokens.

Thanks @benchislett for finding.

This affected performance with async scheduling since requests could be unscheduled before they were finished and have to be scheduled again for the last 1-2 tokens.

Signed-off-by: Nick Hill <nhill@redhat.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the v1 label Nov 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a performance issue related to speculative decoding with asynchronous scheduling. The change correctly adjusts the calculation of max_total_tokens to account for speculative placeholder tokens. This prevents requests from being prematurely unscheduled when they are near their max_tokens limit, which previously required them to be rescheduled to generate the final tokens. The logic is sound, ensuring that enough tokens can be scheduled for speculative verification while still respecting the model's maximum length. The implementation is clear and well-commented, and it should effectively resolve the described performance bottleneck.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill requested a review from benchislett November 27, 2025 02:15
start = start_req_idx
end = end_req_idx
sliced_cu_num_generated_tokens = None
def slice_request(self, req_idx: int, num_positions: int):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing how logprobs are sliced to ensure we don't include extra positions beyond the number of output tokens.

@DarkLight1337 DarkLight1337 merged commit 8e7a891 into vllm-project:main Nov 28, 2025
44 checks passed
@njhill njhill deleted the async-spec-dec-len branch November 28, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants