How much GPU does this work usually require? #26

WangLanxiao · 2019-03-31T07:30:25Z

when i train the project on the GPU0(8G)
i get:
RuntimeError: CUDA out of memory. Tried to allocate 9.50 MiB (GPU 0; 7.93 GiB total capacity; 6.49 GiB already allocated; 14.81 MiB free; 30.49 MiB cached)

when i train the project on the GPU0(12G)
i get:
RuntimeError: CUDA out of memory. Tried to allocate 16.88 MiB (GPU 0; 11.90 GiB total capacity; 10.58 GiB already allocated; 18.44 MiB free; 62.29 MiB cached)

when i train the project on the GPU0(12G) GPU1(12G)
I add:
if args.cuda:
net = nn.DataParallel(net , device_ids=[0,1])
net.cuda()
i get:
RuntimeError: CUDA out of memory. Tried to allocate 16.88 MiB (GPU 0; 11.90 GiB total capacity; 10.58 GiB already allocated; 18.44 MiB free; 62.29 MiB cached)

what can I do to solove this question?Thanks

andfoy · 2019-04-01T05:00:16Z

@WangLanxiao Thanks for your question, we did use an Nvidia Titan Xp to run all of our experiments, and the model did use 12Gb. Maybe the error is related to newer PyTorch releases. What happens if you try to reduce the image size?

WangLanxiao · 2019-04-01T07:28:41Z

which kind of pytorch should I use？now I use pytorch1.0.1 to train it. thanks for your reply！

andfoy · 2019-04-01T16:32:05Z

@WangLanxiao Well, we did develop this model on a very old version of PyTorch (0.2.0), but it worked as far as 0.4.0/1, maybe it is related to the convolution mode CUDNN selects during training. Please follow this PyTorch forum discussion to get more insights. https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936

I forgot to tell you, that unfortunately, DMN is not paralellizable, due to non-convergence during training.

WangLanxiao · 2019-04-03T01:50:43Z

what cause DMN is not paralellizable?
and if i want to use visdom,what should i give to args.visdom an the beginning of the project?

parser.add_argument('--visdom', type=str, default=None,
help='visdom URL endpoint')

andfoy · 2019-04-03T16:48:11Z

what cause DMN is not paralellizable?

I think we have an error related to the lr setup that disables correct gradient update after allreduce on all gpus.

and if i want to use visdom,what should i give to args.visdom an the beginning of the project?

visdom should be the HTTP endpoint on which visdom is available. I think that by deafult, visdom deploys at http://localhost:8097 by default

WangLanxiao · 2019-04-08T01:36:45Z

how can i use the test_dmn?I do not find some main function about test dmn

andfoy · 2019-04-08T02:42:53Z

how can i use the test_dmn?I do not find some main function about test dmn

Are you referring to using DMN on evaluation mode?

WangLanxiao · 2019-04-08T02:45:37Z

yes,i have used evaluation function in train.py.and i get the MIOU.But it do not use test_dmn.py

andfoy · 2019-04-08T05:12:00Z

Actually, that's the expected behaviour, to evaluate a model, a combination of --eval-first and --epochs 0 is required, alongside to the path to the weights file to evaluate

WangLanxiao · 2019-04-08T11:11:33Z

when i change batchsize =2.and give suitable img_w and img_h to dataloader,but
_Traceback (most recent call last):
File "/home/lanxiao/DMS-master/dmn_pytorch/train.py", line 486, in
train_loss = train(epoch)
File "/home/lanxiao/DMS-master/dmn_pytorch/train.py", line 287, in train
for batch_idx, (imgs, masks, words) in enumerate(train_loader):
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in
return [default_collate(samples) for samples in transposed]
File "/home/lanxiao/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 5 in dimension 1 at /opt/conda/conda-bld/pytorch_1549627089062/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307
how can i solve it?Thanks

andfoy · 2019-04-08T18:16:29Z

It is not possible to increase the batch size on DMN, by the same reason that disables the model to be parallelized

andfoy · 2019-04-15T21:22:08Z

@WangLanxiao I suppose this issue was fixed, as we haven't report any activity during the past 7 days, feel free to reopen it if your issue has not been solved or open a new one if you have a new question.

Shivanshmundra · 2019-10-13T13:37:49Z

Hi @andfoy, I was trying on custom dataset but facing an issue. Due to model large gpu requirement, I can't train above 128 image size on single 16 gb nvidia gpu. As model is non-paralellizable, I can't use multiple GPUs. Can you guide any way to train on high resolution data, as 128 image size is very difficult to interpret.

andfoy · 2019-10-14T18:59:55Z

Hi @andfoy, I was trying on custom dataset but facing an issue. Due to model large gpu requirement, I can't train above 128 image size on single 16 gb nvidia gpu. As model is non-paralellizable, I can't use multiple GPUs. Can you guide any way to train on high resolution data, as 128 image size is very difficult to interpret.

@Shivanshmundra which version of PyTorch are you using, AFAIK the model was trained on COCO images at 512px on the largest size

Shivanshmundra · 2019-10-14T19:45:16Z

I am using the latest version of pytorch, will downgrading pytorch version result in less gpu usage?

andfoy · 2019-10-14T20:48:39Z

I am using the latest version of pytorch, will downgrading pytorch version result in less gpu usage?

It is possible, this model was developed using a very old version of PyTorch

Shivanshmundra · 2019-10-15T06:24:31Z

It is possible, this model was developed using a very old version of PyTorch.

I just tried with pytorch 0.4, with cuda 9.0 as you suggested in README, but still it's giving not enough memory error above 128 resolution.

andfoy closed this as completed Apr 15, 2019

andfoy pinned this issue May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much GPU does this work usually require? #26

How much GPU does this work usually require? #26

WangLanxiao commented Mar 31, 2019

andfoy commented Apr 1, 2019

WangLanxiao commented Apr 1, 2019

andfoy commented Apr 1, 2019

WangLanxiao commented Apr 3, 2019

andfoy commented Apr 3, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

andfoy commented Apr 15, 2019

Shivanshmundra commented Oct 13, 2019

andfoy commented Oct 14, 2019

Shivanshmundra commented Oct 14, 2019

andfoy commented Oct 14, 2019

Shivanshmundra commented Oct 15, 2019 •

edited

Loading

How much GPU does this work usually require? #26

How much GPU does this work usually require? #26

Comments

WangLanxiao commented Mar 31, 2019

andfoy commented Apr 1, 2019

WangLanxiao commented Apr 1, 2019

andfoy commented Apr 1, 2019

WangLanxiao commented Apr 3, 2019

andfoy commented Apr 3, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

WangLanxiao commented Apr 8, 2019

andfoy commented Apr 8, 2019

andfoy commented Apr 15, 2019

Shivanshmundra commented Oct 13, 2019

andfoy commented Oct 14, 2019

Shivanshmundra commented Oct 14, 2019

andfoy commented Oct 14, 2019

Shivanshmundra commented Oct 15, 2019 • edited Loading

Shivanshmundra commented Oct 15, 2019 •

edited

Loading