-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] DeepSeek V3 optimization #2591
Comments
Very quick response ! |
The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon. |
Is the memory sufficient for an 8 gpus instance? This model size is too large. |
671B works on H200 * 8 with FP8 (671 < 141 * 8) |
Hi @fengyang95 You can also consider multi node.
|
FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions. |
Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version. pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer |
Checklist
Usage
User Guide for Existing System
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Features
E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json
@BBufmoe_align_block_size
@HandH1998 @zhyncsnextn
speculative decoding @merrymercyRelated resources
No response
The text was updated successfully, but these errors were encountered: