Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Distributed] [Cherry-Pick] Support fused param buffer and ipc meta for optimizer states and parameter storage #70520

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

SylarTiaNII
Copy link
Contributor

@SylarTiaNII SylarTiaNII commented Dec 27, 2024

PR Category

Distributed Strategy

PR Types

New features

Description

CP from: #69625; #70029; #70246; #70323
This is precondition for using Flash Checkpoints on PaddleNLP
Fuse optimizer states and master weights to continuous storage format. And those variables can be obtained with FusionStorageHelper through CUDA IPC with no cost.
set

sharding_parallel_config.enable_fuse_optimizer_states=True

to enable fusion for sharding optimizer.
[Pcard-88789]

…ddle#69625)

* [Distributed] support fuse optimizer

* [Distributed] polish fuse optimizer codes

* [Distributed] add UT for fused optimizer

* [Distributed] support training from scratch with fuse optimizer states enabled

* [Distributed] fix sharding UT

* [Distributed] fix multiprocessing cuda env in distributed usage

* [Distributed] add fused optimizer states ut

* [Distributed] fix other UT

* [Distributed] remove reduction hack fix
Copy link

paddle-bot bot commented Dec 27, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@SylarTiaNII SylarTiaNII force-pushed the cp_fuse_optimizer_states branch from f94dce9 to fc42331 Compare December 27, 2024 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant