You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fine tuning the model on both Gpus reports the following error: RuntimeError: CUDA driver error: invalid argument
Do you know what the problem is?
Reproduction
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl rank1: return self._call_impl(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl rank1: return forward_call(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/transformers/cogvideox_transformer_3d.py", line 148, in forward rank1: ff_output = self.ff(norm_hidden_states) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl rank1: return self._call_impl(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl rank1: return forward_call(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/attention.py", line 1242, in forward rank1: hidden_states = module(hidden_states) rank1: ^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl rank1: return self._call_impl(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl rank1: return forward_call(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/activations.py", line 88, in forward rank1: hidden_states = self.proj(hidden_states) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl rank1: return self._call_impl(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl rank1: return forward_call(*args, **kwargs) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward rank1: return F.linear(input, self.weight, self.bias) rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank1: RuntimeError: CUDA driver error: invalid argument
Steps: 0%| | 0/133600000 [00:12<?, ?it/s]
[rank0]:[W1220 14:39:33.155016577 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_proce ss_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint ha s always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1220 14:39:35.723000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 381224 closing signal SIGTERM
E1220 14:39:36.039000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 381223) of binary: /home/conda_env/controlnet/bin/python
Traceback (most recent call last):
File "/home/conda_env/controlnet/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Unable to deduce what exactly caused this error. I see it happens in the attention feed-forward projection but nothing hinting why. Could you maybe run with CUDA_LAUNCH_BLOCKING=1 and share your results? Does it also happen with pytorch nightly? I'm able to run the scripts for CogVideoX in https://github.com/a-r-r-o-w/finetrainers just fine
Describe the bug
Fine tuning the model on both Gpus reports the following error: RuntimeError: CUDA driver error: invalid argument
Do you know what the problem is?
Reproduction
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/transformers/cogvideox_transformer_3d.py", line 148, in forward
rank1: ff_output = self.ff(norm_hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/attention.py", line 1242, in forward
rank1: hidden_states = module(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/activations.py", line 88, in forward
rank1: hidden_states = self.proj(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
rank1: return F.linear(input, self.weight, self.bias)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: RuntimeError: CUDA driver error: invalid argument
Steps: 0%| | 0/133600000 [00:12<?, ?it/s]
[rank0]:[W1220 14:39:33.155016577 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_proce ss_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint ha s always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1220 14:39:35.723000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 381224 closing signal SIGTERM
E1220 14:39:36.039000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 381223) of binary: /home/conda_env/controlnet/bin/python
Traceback (most recent call last):
File "/home/conda_env/controlnet/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_controlnet.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:39:35
host : robot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 381223)
error_file: <N/A>
traceback : To enable traceback see: https:/pytorch.org/docs/stable/elastic/errors.html
Logs
No response
System Info
ubuntu20.04
cuda 12.0
torch 2.5
diffusers 0.32.0.dev0
Who can help?
No response
The text was updated successfully, but these errors were encountered: