Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

Open
2 tasks done
libratiger opened this issue Dec 23, 2024 · 4 comments
Open
2 tasks done

Comments

@libratiger
Copy link
Contributor

Checklist

Motivation

Thank you for implementing the zero-overhead batch scheduler feature.

After reading about it on https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-schedule, I understand that this optimization is particularly effective for small models, with the most significant speedups observed on small models.

I conducted tests using the Qwen2.5-0.5B-Instruct model on an A100 GPU with SGLang version 0.4.0.post1.

However, the performance results were not as expected. Interestingly, I observed better performance when the overlap scheduler was disabled.

I have repeated these tests multiple times to verify the results, and they remain consistent.

but I found the performance is better when disable the overlap scheduler.

Here is the result(I test multi times, the result is the same)

default(enable-overlap):

python3 -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048
============ Serving Benchmark Result ============
Backend:                 sglang
Traffic request rate:          inf
Max reqeuest concurrency:        not set
Successful requests:           500
Benchmark duration (s):         69.61
Total input tokens:           1026288
Total generated tokens:         506842
Total generated tokens (retokenized):  506758
Request throughput (req/s):       7.18
Input token throughput (tok/s):     14743.84
Output token throughput (tok/s):     7281.38
Total token throughput (tok/s):     22025.22
----------------End-to-End Latency----------------
Mean E2E Latency (ms):          40437.78
Median E2E Latency (ms):         39943.70
---------------Time to First Token----------------
Mean TTFT (ms):             5381.80
Median TTFT (ms):            5151.43
P99 TTFT (ms):              9350.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):             42.32
Median TPOT (ms):            35.03
P99 TPOT (ms):              189.66
---------------Inter-token Latency----------------
Mean ITL (ms):              35.97
Median ITL (ms):             31.34
P99 ITL (ms):              60.09
==================================================

disable-overlap

python3 -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct --disable-overlap
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048
============ Serving Benchmark Result ============
Backend:                 sglang
Traffic request rate:          inf
Max reqeuest concurrency:        not set
Successful requests:           500
Benchmark duration (s):         59.99
Total input tokens:           1026288
Total generated tokens:         506842
Total generated tokens (retokenized):  506752
Request throughput (req/s):       8.33
Input token throughput (tok/s):     17107.33
Output token throughput (tok/s):     8448.61
Total token throughput (tok/s):     25555.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms):          36476.98
Median E2E Latency (ms):         37023.41
---------------Time to First Token----------------
Mean TTFT (ms):             5194.01
Median TTFT (ms):            4901.62
P99 TTFT (ms):              8997.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):             38.72
Median TPOT (ms):            31.32
P99 TPOT (ms):              167.98
---------------Inter-token Latency----------------
Mean ITL (ms):              32.36
Median ITL (ms):             27.18
P99 ITL (ms):              44.47

Related resources

No response

@libratiger
Copy link
Contributor Author

I conducted tests on Qwen2.5-3B-Instruct and found that enabling overlap indeed produced better results.

The possible reason is that with smaller models, GPU computations are very fast, and the overhead and synchronization costs between threads exceed the benefits of overlap. In such cases, a run-to-completion mode proves more optimal.

@merrymercy @zhaochenyang20

@libratiger
Copy link
Contributor Author

I took a quick look at the relevant code, It seem that the launch_done is not necessary, as the copy_done sync can always guarantee the launch_done.

@CSEEduanyu
Copy link

I conducted tests on Qwen2.5-3B-Instruct and found that enabling overlap indeed produced better results.

The possible reason is that with smaller models, GPU computations are very fast, and the overhead and synchronization costs between threads exceed the benefits of overlap. In such cases, a run-to-completion mode proves more optimal.

@merrymercy @zhaochenyang20

what is run-to-completion mode?

@zhaochenyang20
Copy link
Collaborator

I am not on this part sorry. Let's ask lianmin for help. @merrymercy Also, it's near Chrismas, so we will have serval days' delay. 😂 Thanks! @libratiger @CSEEduanyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants