[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

libratiger · 2024-12-23T10:22:08Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Thank you for implementing the zero-overhead batch scheduler feature.

After reading about it on https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-schedule, I understand that this optimization is particularly effective for small models, with the most significant speedups observed on small models.

I conducted tests using the Qwen2.5-0.5B-Instruct model on an A100 GPU with SGLang version 0.4.0.post1.

However, the performance results were not as expected. Interestingly, I observed better performance when the overlap scheduler was disabled.

I have repeated these tests multiple times to verify the results, and they remain consistent.

but I found the performance is better when disable the overlap scheduler.

Here is the result(I test multi times, the result is the same)

default(enable-overlap):

python3 -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048

============ Serving Benchmark Result ============
Backend:                 sglang
Traffic request rate:          inf
Max reqeuest concurrency:        not set
Successful requests:           500
Benchmark duration (s):         69.61
Total input tokens:           1026288
Total generated tokens:         506842
Total generated tokens (retokenized):  506758
Request throughput (req/s):       7.18
Input token throughput (tok/s):     14743.84
Output token throughput (tok/s):     7281.38
Total token throughput (tok/s):     22025.22
----------------End-to-End Latency----------------
Mean E2E Latency (ms):          40437.78
Median E2E Latency (ms):         39943.70
---------------Time to First Token----------------
Mean TTFT (ms):             5381.80
Median TTFT (ms):            5151.43
P99 TTFT (ms):              9350.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):             42.32
Median TPOT (ms):            35.03
P99 TPOT (ms):              189.66
---------------Inter-token Latency----------------
Mean ITL (ms):              35.97
Median ITL (ms):             31.34
P99 ITL (ms):              60.09
==================================================

disable-overlap

python3 -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct --disable-overlap
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048

============ Serving Benchmark Result ============
Backend:                 sglang
Traffic request rate:          inf
Max reqeuest concurrency:        not set
Successful requests:           500
Benchmark duration (s):         59.99
Total input tokens:           1026288
Total generated tokens:         506842
Total generated tokens (retokenized):  506752
Request throughput (req/s):       8.33
Input token throughput (tok/s):     17107.33
Output token throughput (tok/s):     8448.61
Total token throughput (tok/s):     25555.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms):          36476.98
Median E2E Latency (ms):         37023.41
---------------Time to First Token----------------
Mean TTFT (ms):             5194.01
Median TTFT (ms):            4901.62
P99 TTFT (ms):              8997.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):             38.72
Median TPOT (ms):            31.32
P99 TPOT (ms):              167.98
---------------Inter-token Latency----------------
Mean ITL (ms):              32.36
Median ITL (ms):             27.18
P99 ITL (ms):              44.47

Related resources

No response

The text was updated successfully, but these errors were encountered:

libratiger · 2024-12-23T10:27:31Z

I conducted tests on Qwen2.5-3B-Instruct and found that enabling overlap indeed produced better results.

The possible reason is that with smaller models, GPU computations are very fast, and the overhead and synchronization costs between threads exceed the benefits of overlap. In such cases, a run-to-completion mode proves more optimal.

@merrymercy @zhaochenyang20

libratiger · 2024-12-23T10:32:13Z

I took a quick look at the relevant code, It seem that the launch_done is not necessary, as the copy_done sync can always guarantee the launch_done.

sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py

Line 167 in 23e5e50

self.launch_done.wait()

CSEEduanyu · 2024-12-23T14:32:54Z

I conducted tests on Qwen2.5-3B-Instruct and found that enabling overlap indeed produced better results.

The possible reason is that with smaller models, GPU computations are very fast, and the overhead and synchronization costs between threads exceed the benefits of overlap. In such cases, a run-to-completion mode proves more optimal.

@merrymercy @zhaochenyang20

what is run-to-completion mode？

zhaochenyang20 · 2024-12-23T19:48:27Z

I am not on this part sorry. Let's ask lianmin for help. @merrymercy Also, it's near Chrismas, so we will have serval days' delay. 😂 Thanks! @libratiger @CSEEduanyu

This was referenced Dec 25, 2024

Refactor SchedulePolicy to improve code organization #2571

Open

Refactor Scheduler to improve code organization #2593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

libratiger commented Dec 23, 2024

libratiger commented Dec 23, 2024

libratiger commented Dec 23, 2024

CSEEduanyu commented Dec 23, 2024

zhaochenyang20 commented Dec 23, 2024

[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

[Feature] Improve the Zero-Overhead Batch Scheduler performance for the small model #2558

Comments

libratiger commented Dec 23, 2024

Checklist

Motivation

default(enable-overlap):

disable-overlap

Related resources

libratiger commented Dec 23, 2024

libratiger commented Dec 23, 2024

CSEEduanyu commented Dec 23, 2024

zhaochenyang20 commented Dec 23, 2024