Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DeepSeek V3 optimization #2591

Open
4 of 15 tasks
zhyncs opened this issue Dec 26, 2024 · 7 comments
Open
4 of 15 tasks

[Feature] DeepSeek V3 optimization #2591

zhyncs opened this issue Dec 26, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request high priority performance quant LLM Quantization

Comments

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Checklist

Usage

User Guide for Existing System

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Features

Related resources

No response

@zhyncs zhyncs added enhancement New feature or request performance quant LLM Quantization labels Dec 26, 2024
@zhyncs zhyncs pinned this issue Dec 26, 2024
@libratiger
Copy link
Contributor

Very quick response !
I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default.
At least some special optimizations are needed?

@merrymercy
Copy link
Contributor

merrymercy commented Dec 26, 2024

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

@fengyang95
Copy link

fengyang95 commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 27, 2024

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority performance quant LLM Quantization
Projects
None yet
Development

No branches or pull requests

8 participants