-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558
Comments
I conducted tests on The possible reason is that with smaller models, GPU computations are very fast, and the overhead and synchronization costs between threads exceed the benefits of overlap. In such cases, a |
I took a quick look at the relevant code, It seem that the
|
what is |
I am not on this part sorry. Let's ask lianmin for help. @merrymercy Also, it's near Chrismas, so we will have serval days' delay. 😂 Thanks! @libratiger @CSEEduanyu |
Checklist
Motivation
Thank you for implementing the zero-overhead batch scheduler feature.
After reading about it on https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-schedule, I understand that this optimization is particularly effective for small models, with the most significant speedups observed on small models.
I conducted tests using the
Qwen2.5-0.5B-Instruct
model on an A100 GPU with SGLang version0.4.0.post1
.However, the performance results were not as expected. Interestingly, I observed better performance when the overlap scheduler was disabled.
I have repeated these tests multiple times to verify the results, and they remain consistent.
but I found the performance is better when disable the overlap scheduler.
Here is the result(I test multi times, the result is the same)
default(enable-overlap):
disable-overlap
Related resources
No response
The text was updated successfully, but these errors were encountered: