Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单卡训练可以,但是多卡训练报NCLL超时 #2317

Open
jacksonlee02365894 opened this issue Dec 18, 2024 · 0 comments
Open

单卡训练可以,但是多卡训练报NCLL超时 #2317

jacksonlee02365894 opened this issue Dec 18, 2024 · 0 comments
Labels
question Further information is requested

Comments

@jacksonlee02365894
Copy link

同样的数据集,单卡训练都是正常的,但是多卡训练的时候,会报错。
已经尝试过逐步减小batch_size,并没有解决问题。
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=4311252, NumelOut=4311252, Timeout(ms)=600000) ran for 600057 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[2024-12-18 16:44:15,305][root][INFO] - Validate epoch: 1, rank: 1

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 3437, last enqueued NCCL work: 3437, last completed NCCL work: 3436.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a60b7a897 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f2a148651b2 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a14869fd0 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a1486b31c in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f2a602c7bf4 in /home/data/miniconda3/envs/asr/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f2a61647ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f2a616d8a04 in /lib/x86_64-linux-gnu/libc.so.6)

@jacksonlee02365894 jacksonlee02365894 added the question Further information is requested label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant