-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataParallel crashes about "not same cuda device" when training after onnx export #1582
Comments
I have been doing some work in #1584 that is relevant, and I encountered the same issues at some points. The problem probably has to do with |
The reproducer you posted is actually working on the state introduced in #1584. |
Thanks for reply! I will close this issue since it is solved in latest version 😊 |
Unfortunately, it's still reproduced with latest NNCF in Optimum tests https://github.com/huggingface/optimum-intel/blob/main/tests/openvino/test_training_examples.py: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select) |
It's reproduced for quantization aware training only as well: |
@ljaljushkin , @vshampor , is it still valid? |
@MaximProshin yes, it's still valid |
When using Data Parallel in pytorch (not DDP), after calling
compression_ctrl.export_model()
, the subsequent training will get crashed forExpected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Scripts for reproducing the error: https://gist.github.com/yujiepan-work/964d4716902ee75bf132dc4d80c96e61
After some debugging, I found that after onnx export, the "cuda:1" replicate model in DP is doing forward with the wrong model parameter from "cuda:0".
The text was updated successfully, but these errors were encountered: