Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much GPU does this work usually require? #26

Closed
WangLanxiao opened this issue Mar 31, 2019 · 17 comments
Closed

How much GPU does this work usually require? #26

WangLanxiao opened this issue Mar 31, 2019 · 17 comments

Comments

@WangLanxiao
Copy link

when i train the project on the GPU0(8G)
i get:
RuntimeError: CUDA out of memory. Tried to allocate 9.50 MiB (GPU 0; 7.93 GiB total capacity; 6.49 GiB already allocated; 14.81 MiB free; 30.49 MiB cached)

when i train the project on the GPU0(12G)
i get:
RuntimeError: CUDA out of memory. Tried to allocate 16.88 MiB (GPU 0; 11.90 GiB total capacity; 10.58 GiB already allocated; 18.44 MiB free; 62.29 MiB cached)

when i train the project on the GPU0(12G) GPU1(12G)
I add:
if args.cuda:
net = nn.DataParallel(net , device_ids=[0,1])
net.cuda()
i get:
RuntimeError: CUDA out of memory. Tried to allocate 16.88 MiB (GPU 0; 11.90 GiB total capacity; 10.58 GiB already allocated; 18.44 MiB free; 62.29 MiB cached)

what can I do to solove this question?Thanks

@andfoy
Copy link
Contributor

andfoy commented Apr 1, 2019

@WangLanxiao Thanks for your question, we did use an Nvidia Titan Xp to run all of our experiments, and the model did use 12Gb. Maybe the error is related to newer PyTorch releases. What happens if you try to reduce the image size?

@WangLanxiao
Copy link
Author

which kind of pytorch should I use?now I use pytorch1.0.1 to train it. thanks for your reply!

@andfoy
Copy link
Contributor

andfoy commented Apr 1, 2019

@WangLanxiao Well, we did develop this model on a very old version of PyTorch (0.2.0), but it worked as far as 0.4.0/1, maybe it is related to the convolution mode CUDNN selects during training. Please follow this PyTorch forum discussion to get more insights. https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936

I forgot to tell you, that unfortunately, DMN is not paralellizable, due to non-convergence during training.

@WangLanxiao
Copy link
Author

what cause DMN is not paralellizable?
and if i want to use visdom,what should i give to args.visdom an the beginning of the project?

parser.add_argument('--visdom', type=str, default=None,
help='visdom URL endpoint')

@andfoy
Copy link
Contributor

andfoy commented Apr 3, 2019

what cause DMN is not paralellizable?

I think we have an error related to the lr setup that disables correct gradient update after allreduce on all gpus.

and if i want to use visdom,what should i give to args.visdom an the beginning of the project?

visdom should be the HTTP endpoint on which visdom is available. I think that by deafult, visdom deploys at http://localhost:8097 by default

@WangLanxiao
Copy link
Author

how can i use the test_dmn?I do not find some main function about test dmn

@andfoy
Copy link
Contributor

andfoy commented Apr 8, 2019

how can i use the test_dmn?I do not find some main function about test dmn

Are you referring to using DMN on evaluation mode?

@WangLanxiao
Copy link
Author

yes,i have used evaluation function in train.py.and i get the MIOU.But it do not use test_dmn.py

@andfoy
Copy link
Contributor

andfoy commented Apr 8, 2019

Actually, that's the expected behaviour, to evaluate a model, a combination of --eval-first and --epochs 0 is required, alongside to the path to the weights file to evaluate

@WangLanxiao
Copy link
Author

when i change batchsize =2.and give suitable img_w and img_h to dataloader,but
_Traceback (most recent call last):
File "/home/lanxiao/DMS-master/dmn_pytorch/train.py", line 486, in
train_loss = train(epoch)
File "/home/lanxiao/DMS-master/dmn_pytorch/train.py", line 287, in train
for batch_idx, (imgs, masks, words) in enumerate(train_loader):
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 5 in dimension 1 at /opt/conda/conda-bld/pytorch_1549627089062/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

how can i solve it?Thanks

@andfoy
Copy link
Contributor

andfoy commented Apr 8, 2019

It is not possible to increase the batch size on DMN, by the same reason that disables the model to be parallelized

@andfoy
Copy link
Contributor

andfoy commented Apr 15, 2019

@WangLanxiao I suppose this issue was fixed, as we haven't report any activity during the past 7 days, feel free to reopen it if your issue has not been solved or open a new one if you have a new question.

@andfoy andfoy closed this as completed Apr 15, 2019
@andfoy andfoy pinned this issue May 16, 2019
@Shivanshmundra
Copy link

Hi @andfoy, I was trying on custom dataset but facing an issue. Due to model large gpu requirement, I can't train above 128 image size on single 16 gb nvidia gpu. As model is non-paralellizable, I can't use multiple GPUs. Can you guide any way to train on high resolution data, as 128 image size is very difficult to interpret.

@andfoy
Copy link
Contributor

andfoy commented Oct 14, 2019

Hi @andfoy, I was trying on custom dataset but facing an issue. Due to model large gpu requirement, I can't train above 128 image size on single 16 gb nvidia gpu. As model is non-paralellizable, I can't use multiple GPUs. Can you guide any way to train on high resolution data, as 128 image size is very difficult to interpret.

@Shivanshmundra which version of PyTorch are you using, AFAIK the model was trained on COCO images at 512px on the largest size

@Shivanshmundra
Copy link

I am using the latest version of pytorch, will downgrading pytorch version result in less gpu usage?

@andfoy
Copy link
Contributor

andfoy commented Oct 14, 2019

I am using the latest version of pytorch, will downgrading pytorch version result in less gpu usage?

It is possible, this model was developed using a very old version of PyTorch

@Shivanshmundra
Copy link

Shivanshmundra commented Oct 15, 2019

It is possible, this model was developed using a very old version of PyTorch.

I just tried with pytorch 0.4, with cuda 9.0 as you suggested in README, but still it's giving not enough memory error above 128 resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants