-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[BugFix] Fix spec decoding max_tokens scheduling perf issue #29542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This affected performance with async scheduling since requests could be unscheduled before they were finished and have to be scheduled again for the last 1-2 tokens. Signed-off-by: Nick Hill <nhill@redhat.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a performance issue related to speculative decoding with asynchronous scheduling. The change correctly adjusts the calculation of max_total_tokens to account for speculative placeholder tokens. This prevents requests from being prematurely unscheduled when they are near their max_tokens limit, which previously required them to be rescheduled to generate the final tokens. The logic is sound, ensuring that enough tokens can be scheduled for speculative verification while still respecting the model's maximum length. The implementation is clear and well-commented, and it should effectively resolve the described performance bottleneck.
Signed-off-by: Nick Hill <nhill@redhat.com>
| start = start_req_idx | ||
| end = end_req_idx | ||
| sliced_cu_num_generated_tokens = None | ||
| def slice_request(self, req_idx: int, num_positions: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing how logprobs are sliced to ensure we don't include extra positions beyond the number of output tokens.
This affected performance with async scheduling since requests could be unscheduled before they were finished and have to be scheduled again for the last 1-2 tokens.
Thanks @benchislett for finding.