can we use gpu when run demo fin_model? #445

YZH0216 · 2024-10-21T07:25:17Z

when i run "rdagent fin_model", it works well on my cpu to train a GRU. How to use gpu device such as "cuda:0" to run this demo?
Some outputs of my terminal when running this script are as follows:

[1:MainThread](2024-10-21 03:13:05,144) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:74] - GeneralPTNN pytorch version...
[1:MainThread](2024-10-21 03:13:05,157) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:92] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cpu
n_jobs : 20
use_GPU : False
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20, 'num_timesteps': 20}
[1:MainThread](2024-10-21 03:13:05,158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model:
EnhancedDeepGRUModel(
(gru): GRU(20, 256, num_layers=5, batch_first=True, dropout=0.4)
(fc): Linear(in_features=256, out_features=1, bias=True)
)

TPLin22 · 2024-10-21T07:52:31Z

Hi,

You could firstly check if you've chosen the correct base image in your Dockerfile to support GPU functionality.
The Dockerfile can be found at rdagent/scenarios/qlib/docker.

YZH0216 · 2024-10-21T12:35:26Z

I think I have right docker file, the codes are listed below.
`FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime

For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags

RUN apt-get clean && apt-get update && apt-get install -y \
curl \
vim \
git \
build-essential
&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/microsoft/qlib.git

WORKDIR /workspace/qlib

RUN git reset c9ed050ef034fe6519c14b59f3d207abcb693282 --hard

RUN python -m pip install --upgrade cython -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN python -m pip install -e . -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

RUN pip install catboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install xgboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install scipy==1.11.4 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
`

I also successfully generarte docker image called "local_qlib", and if I run this image by "docker run --rm -ti --gpus all local_qlib /bin/bash", I can see normal output by running "nvidia-smi" in this image.
`
(rdagent) youme@youme-System-Product-Name:~/Documents/PythonProjects/RD-Agent$ docker run --rm -ti --gpus all local_qlib /bin/bash
root@8fa2d3b4c6eb:/workspace/qlib# nvidia-smi
Mon Oct 21 12:30:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0A:00.0 On | N/A |
| 44% 55C P2 111W / 350W | 2724MiB / 12288MiB | 16% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@8fa2d3b4c6eb:/workspace/qlib# ^C
root@8fa2d3b4c6eb:/workspace/qlib# exit
`

However, when I run "rdagent fin_model", the ERROR are listed below.

[1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model: DeepGRUModel( (gru): GRU(20, 128, num_layers=3, batch_first=True, dropout=0.2) (fc): Linear(in_features=128, out_features=1, bias=True) ) [1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model size: 0.2440 MB [1:MainThread](2024-10-21 12:20:21,520) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waitingasync_logDone [1:MainThread](2024-10-21 12:20:21,522) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. ]. File "/opt/conda/bin/qrun", line 8, in <module> sys.exit(run()) File "/workspace/qlib/qlib/workflow/cli.py", line 151, in run fire.Fire(workflow) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/qlib/qlib/workflow/cli.py", line 145, in workflow recorder = task_train(config.get("task"), experiment_name=experiment_name) File "/workspace/qlib/qlib/model/trainer.py", line 127, in task_train _exe_task(task_config) File "/workspace/qlib/qlib/model/trainer.py", line 45, in _exe_task model: Model = init_instance_by_config(task_config["model"], accept_types=Model) File "/workspace/qlib/qlib/utils/mod.py", line 180, in init_instance_by_config return klass(**cls_kwargs, **try_kwargs, **kwargs) File "/workspace/qlib/qlib/contrib/model/pytorch_general_nn.py", line 140, in __init__ self.dnn_model.to(self.device) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 216, in _apply ret = super()._apply(fn, recurse) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

YZH0216 · 2024-10-21T12:41:42Z

Besides, it seems the docker container can correctly detect the gpu device, the log detail are listed below.

2024-10-21 20:20:18.348 | INFO | rdagent.utils.env:_gpu_kwargs:269 - GPU Devices are available.

YZH0216 added the question Further information is requested label Oct 21, 2024

JIBSIL mentioned this issue Nov 28, 2024

RD-Agent fin_model fails on 1-2 GPU systems (w/ fix) #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can we use gpu when run demo fin_model? #445

can we use gpu when run demo fin_model? #445

YZH0216 commented Oct 21, 2024

TPLin22 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024 •

edited

Loading

can we use gpu when run demo fin_model? #445

can we use gpu when run demo fin_model? #445

Comments

YZH0216 commented Oct 21, 2024

TPLin22 commented Oct 21, 2024

YZH0216 commented Oct 21, 2024

For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags

YZH0216 commented Oct 21, 2024 • edited Loading

YZH0216 commented Oct 21, 2024 •

edited

Loading