-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: why duplicate PID appears on rank 0 #6111
Comments
While dig into, I found that when saving the optimizer, the PIDs from other ranks appear on rank 0.
|
I observed that, following this line:
Furthermore, after reaching this line:
If
And after reaching this line:
the PID for other ranks still starts appearing on each rank. |
any Colossalai-ers could help me? Thanks a lot. |
Is there an existing issue for this bug?
🐛 Describe the bug
When using the GeminiPlugin to train a model, it runs normally at the start. However, once a checkpoint(shard) is saved, a duplicate PID appears on rank 0.
Start:
After Saved Checkpoint
Why and how to avoid it ? Thanks a lot
Environment
Torch: 2.1.2
Colossalai: 0.4.2
Python: 3.8
Cuda: 12.1.0
The text was updated successfully, but these errors were encountered: