[ROCm] add support for ROCm/HIP device #6086

jeffdaily · 2023-09-08T16:21:34Z

To build for ROCm:

./helpers/hipify.sh
mkdir build
cd build
cmake -DUSE_ROCM=1 ..

CUDA source files are hipified in-place using the helper script before running cmake. The "cuda" device is re-used for rocm, so device=cuda will work the same for rocm builds.

Summary of changes:

CMakeLists.txt ROCm updates, also replace glob with explicit file list
support both warpSize 32 and 64
helpers/hipify.sh script added
.gitignore to ignore generated hip source files *.prehip
disable compiler warnings
move __device__ template function PercentileDevice into header
bug fixes for __host__ __define__

- CMakeLists.txt ROCm updates, also replace glob with explicit file list - initial warpSize interop changes - helpers/hipify.sh script added - .gitignore to ignore generated hip source files

- disable compiler warnings - move PercentileDevice __device__ template function into header - bug fixes for __host__ __define__ and __HIP__ preprocessor symbols

jameslamb

Thanks for your interest in LightGBM. Since I'm not aware of any prior conversation in this project about adding support like this, we have some questions before spending time supporting this.

what is ROCm/HIP? Where can we read to learn more?
what is the value of this addition to LightGBM's users? What does this offer that the OpenCL-based and CUDA-based builds of LightGBM don't already offer?
- this project's OpenCL-based GPU build is already struggling from a severe lack of maintenance... I'm very skeptical of taking on a third GPU build
how might we test this? What types of devices should we expect to be supported?

shiyu1994 · 2023-09-08T16:52:52Z

@jeffdaily Thank you, this is very exciting! @jameslamb ROCm is the counterpart of CUDA for AMD GPU. I don't have any prior discussion with @jeffdaily about this. But it is very exciting if we can enlarge the devices supported by LightGBM.

jeffdaily · 2023-09-08T17:22:59Z

Apologies for coming out of nowhere with this. We use LightGBM; the OpenCL-based 'gpu' device already works on our AMD GPUs. But we were curious if we could get better performance if we ported the 'cuda' device to AMD GPUs. This started as a proof of concept, but it seemed useful to share even in its current state.

Using the GPU-Tutorial, here are my results on our MI210.

what is evaluated	CPU	GPU/OpenCL	"cuda" but really ROCm
correctness	auc : 0.821268 18.547533 seconds	auc : 0.821268 20.386780 seconds	auc : 0.821268 9.049307 seconds
speed objective=binary metric=auc	22.604444 seconds	18.028674 seconds	7.787303 seconds
speed objective=regression_l2 metric=l2	18.961535 seconds	14.491217 seconds	7.871302 seconds

jeffdaily · 2023-09-08T17:55:11Z

what is ROCm/HIP? Where can we read to learn more?

https://rocm.docs.amd.com/en/latest/rocm.html

what is the value of this addition to LightGBM's users? What does this offer that the OpenCL-based and CUDA-based builds of LightGBM don't already offer?

See the perf results from the comment above.

this project's OpenCL-based GPU build is already struggling from a severe lack of maintenance... I'm very skeptical of taking on a third GPU build

how might we test this? What types of devices should we expect to be supported?

Here is the current list of supported AMD GPUs.

To test this, you'll need to run on one of the supported AMD GPUs. How is the cuda device currently tested?

ibustany · 2023-09-08T19:20:35Z

Thank you and kudos Jeff!
This work has been much needed!
Best regards,
Ismail

jameslamb · 2023-09-08T20:03:09Z

To test this, you'll need to run on one of the supported AMD GPUs. How is the cuda device currently tested?

We run a VM in Azure with a Tesla V100 on it, and schedule jobs onto it via GitHub Actions.

example build link: https://github.com/microsoft/LightGBM/actions/runs/6123938185/job/16622920873#step:5:34-51
configuration:

LightGBM/.github/workflows/cuda.yml

Line 25 in 04b66e0

runs-on: [self-hosted, linux]

Are you aware of any free CI service supporting AMD GPUs? Otherwise, since I see you work for AMD and since merging this might further AMD's interests... would AMD maybe be willing to fund testing resources for this project? Maybe that's something you and @shiyu1994 (the only maintainer here who's employed by Microsoft) could coordinate?

jeffdaily · 2023-09-08T20:09:14Z

Are you aware of any free CI service supporting AMD GPUs? Otherwise, since I see you work for AMD and since merging this might further AMD's interests... would AMD maybe be willing to fund testing resources for this project? Maybe that's something you and @shiyu1994 (the only maintainer here who's employed by Microsoft) could coordinate?

Microsoft does have an AMD GPU deployment. I'm aware of it being used for onnxruntime CI purposes. I wonder if some of those resources could be used here? @shiyu1994?

jeffdaily · 2023-09-08T22:45:42Z

Noting that the only CI failure currently is not related to my changes. It seems to be a perhaps temporary environment setup issue for that job.

shiyu1994 · 2023-09-13T03:34:28Z

I have access to some AMD MI100 GPUs. But we still need separate budget for an agent with an AMD GPU if we want to test automatically in ci. Do you think it is acceptable if I run the tests for AMD GPU offline without an additional agent for ci? Given that the code for GPU version is shared by both CUDA and ROCm. @jameslamb @guolinke @jeffdaily.

jameslamb · 2023-09-13T14:10:08Z

Do you think it is acceptable if I run the tests for AMD GPU offline without an additional agent for ci?

If you feel confident in these changes based on that, and you think the added complexity in the CUDA code is worth it, that's fine with me. I'll defer to your opinion.

But without a CI job, there's a high risk that future refactorings will break this support again.

dismissing

jameslamb · 2023-09-13T14:12:53Z

I dismissed my review, so that it doesn't block merging. My initial questions have been answered, thanks very much for those links and all that information!

@shiyu1994 and @guolinke seem excited about this addition... that's good enough for me 😊

I'll defer to them to review the code, as I know very little about CUDA.

shiyu1994 · 2023-10-08T15:26:20Z

@jeffdaily Thanks for the great work! I'll review this in the next few days.

shiyu1994 · 2023-12-01T15:57:47Z

Thanks again for the contribution. I just got a Windows server with AMD MI25 GPU. I'm trying to use that server as a CI agent. Hopefully it won't be difficult.

StrikerRUS · 2024-10-08T15:55:53Z

It's a pity that such wonderful PR was abandoned! 😢

Quite interesting that HIP code can be run on NVIDIA cards!
https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/how_to_guides/install.html
ROCm/HIP#3310

I believe that we'll be able to run HIP code on our NVIDIA CI machine. It's not perfect and doesn't guarantee that code works well on AMD, but at least it guarantee that code isn't broken.

shiyu1994 · 2024-10-28T04:39:08Z

It's a pity that such wonderful PR was abandoned! 😢

Quite interesting that HIP code can be run on NVIDIA cards! https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/how_to_guides/install.html ROCm/HIP#3310

I believe that we'll be able to run HIP code on our NVIDIA CI machine. It's not perfect and doesn't guarantee that code works well on AMD, but at least it guarantee that code isn't broken.

I'm picking this up. Let's merge this recently.

jameslamb

@shiyu1994 thanks for picking this up!

I left one quick blocking suggestion, but haven't otherwise reviewed this. Will you please @ me once CI is passing? I can give a more thorough review then.

CMakeLists.txt

Co-authored-by: James Lamb <[email protected]>

StrikerRUS · 2024-10-29T09:36:48Z

@shiyu1994

I'm picking this up.

That's just awesome! Thanks!

shiyu1994

@jeffdaily Thanks for your contribution. Will wait for other reviewers for more comments.

shiyu1994 · 2024-12-17T05:45:20Z

@jameslamb Hi James, you may review this now. The CI issues have been fixed.

shiyu1994 · 2024-12-17T05:56:09Z

It's a pity that such wonderful PR was abandoned! 😢

Quite interesting that HIP code can be run on NVIDIA cards! https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/how_to_guides/install.html ROCm/HIP#3310

I believe that we'll be able to run HIP code on our NVIDIA CI machine. It's not perfect and doesn't guarantee that code works well on AMD, but at least it guarantee that code isn't broken.

What about we enable this with a separate PR.

StrikerRUS

@shiyu1994 Thanks a lot for pushing this PR forward. I left some initial comments about CMake and CI.

StrikerRUS · 2024-12-17T14:00:45Z

.ci/hipify.sh

+    do
+        find ${DIR} -name "*.${EXT}" -exec sh -c '
+          echo "hipifying $1 in-place"
+           hipify-perl "$1" -inplace &


Where do we get hipify-perl script?

It is installed when installing HIP.
https://github.com/ROCm/HIP/blob/master/INSTALL.md

StrikerRUS · 2024-12-17T14:02:46Z

.ci/hipify.sh

@@ -0,0 +1,16 @@
+#!/bin/bash


I think this file should be added in a follow-up PR in which we'll enable hipifying at our CI or will request users hipify localy before suggesting a CUDA-related PRs.

I agree. We can postpone this to the next PR for ROCm.

StrikerRUS · 2024-12-17T14:05:25Z

CMakeLists.txt

+    message(STATUS "ALLFEATS_DEFINES: ${ALLFEATS_DEFINES}")
+    message(STATUS "FULLDATA_DEFINES: ${FULLDATA_DEFINES}")
+
+    function(add_histogram hsize hname hadd hconst hdir)


How this function differs from existing one for CUDA? Can we reuse it or merge these two functions into one?

LightGBM/CMakeLists.txt

Line 275 in 480600b

function(add_histogram hsize hname hadd hconst hdir)

The histogram*.cu files are only used with USE_GPU=ON, we can remove this actually. I'm not sure why they appear in USE_CUDA at current commit. Maybe we should move it into an if (USE_GPU) clause.

I see. They are not used with the USE_GPU version. Instead, they are used in the old CUDA version. Given that version has already been dropped. We can remove this.

StrikerRUS · 2024-12-17T14:07:00Z

CMakeLists.txt

+      )
+    endfunction()
+
+    foreach(hsize _16_64_256)


Same question as for add_histogram function. Can we [incapsulate this for-loop into a function and] reuse it with CUDA and HIP?

StrikerRUS · 2024-12-17T14:08:04Z

CMakeLists.txt

+    endforeach()
+endif()
+
+if(USE_HDFS)


HDFS support was dropped some time ago. This if block should be removed.

Done in 8f6600e.

StrikerRUS · 2024-12-17T14:08:24Z

CMakeLists.txt

+  target_link_libraries(_lightgbm PRIVATE ${histograms})
+endif()
+
+if(USE_HDFS)


Remove this.

Done in 8f6600e.

StrikerRUS · 2024-12-17T14:09:48Z

CMakeLists.txt

@@ -644,6 +729,20 @@ if(USE_CUDA)
  target_link_libraries(_lightgbm PRIVATE ${histograms})
 endif()

+if(USE_ROCM)


Can we merge CUDA and HIP with if( USE_CUDA OR USE_ROCM) here?

Done in 8f6600e.

StrikerRUS · 2024-12-17T14:18:32Z

@shiyu1994

What about we enable this with a separate PR.

Yeah, I support separating PRs: this one with modifications of CUDA files and CMake, a following-up PR with CI jobs for ROCm and hipifying scripts.

StrikerRUS

Thanks for updating the code!

I think this PR is blocked by #6766.

Also, I searched the CUDA code in the repo for the literals 32 and 64 and left some comments in places where warpsize can potentially be adjusted.

StrikerRUS · 2024-12-24T17:30:26Z

CMakeLists.txt

@@ -4,6 +4,7 @@ option(USE_GPU "Enable GPU-accelerated training" OFF)
 option(USE_SWIG "Enable SWIG to generate Java API" OFF)
 option(USE_TIMETAG "Set to ON to output time costs" OFF)
 option(USE_CUDA "Enable CUDA-accelerated training " OFF)
+option(USE_ROCM "Enable ROCM-accelerated training " OFF)


Suggested change

option(USE_ROCM "Enable ROCM-accelerated training " OFF)

option(USE_ROCM "Enable ROCm-accelerated training " OFF)

StrikerRUS · 2024-12-24T17:30:40Z

CMakeLists.txt

@@ -160,6 +161,11 @@ if(USE_CUDA)
    set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
 endif()

+if(USE_ROCM)
+    enable_language(HIP)
+    set(USE_OPENMP ON CACHE BOOL "ROCM requires OpenMP" FORCE)


Suggested change

set(USE_OPENMP ON CACHE BOOL "ROCM requires OpenMP" FORCE)

set(USE_OPENMP ON CACHE BOOL "ROCm requires OpenMP" FORCE)

StrikerRUS · 2024-12-24T17:38:24Z

CMakeLists.txt

+if(USE_ROCM)
+    find_package(HIP)
+    include_directories(${HIP_INCLUDE_DIRS})
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__HIP_PLATFORM_AMD__")


Is it the same as

-DCMAKE_HIP_PLATFORM=amd

?
https://cmake.org/cmake/help/latest/variable/CMAKE_HIP_PLATFORM.html#variable:CMAKE_HIP_PLATFORM

Should we also set HIP_ARCHITECTURES?

For NVIDIA, are they reused from CUDA_ARCHITECTURES?

StrikerRUS · 2024-12-24T17:40:35Z

CMakeLists.txt

+
+    add_definitions(-DUSE_CUDA)
+
+    set(


Not used. See #6766 (review).

StrikerRUS · 2024-12-24T17:41:01Z

CMakeLists.txt


+if(USE_ROCM OR USE_CUDA)


Not used. See #6766 (review).

StrikerRUS · 2024-12-24T20:41:04Z

include/LightGBM/cuda/cuda_rocm_interop.h

+#define __shfl_down_sync(mask, val, offset) __shfl_down(val, offset)
+#define __shfl_up_sync(mask, val, offset) __shfl_up(val, offset)
+// ROCm warpSize is constexpr and is either 32 or 64 depending on gfx arch.
+#define WARPSIZE warpSize


Should WARPSIZE be also used here?

LightGBM/src/treelearner/cuda/cuda_gradient_discretizer.cu

Line 25 in 60b0155

__shared__ score_t shared_mem_buffer[32];

LightGBM/src/treelearner/cuda/cuda_gradient_discretizer.cu

Line 59 in 60b0155

__shared__ score_t shared_mem_buffer[32];

StrikerRUS · 2024-12-24T20:46:20Z

src/treelearner/cuda/cuda_histogram_constructor.cu

@@ -742,7 +744,7 @@ __global__ void FixHistogramKernel(
  const int* cuda_need_fix_histogram_features,
  const uint32_t* cuda_need_fix_histogram_features_num_bin_aligned,
  const CUDALeafSplitsStruct* cuda_smaller_leaf_splits) {
-  __shared__ hist_t shared_mem_buffer[32];
+  __shared__ hist_t shared_mem_buffer[WARPSIZE];


Should WARPSIZE be also used here?

LightGBM/src/treelearner/cuda/cuda_histogram_constructor.cu

Line 836 in 60b0155

__shared__ int64_t shared_mem_buffer[32];

StrikerRUS · 2024-12-24T21:27:52Z

src/treelearner/cuda/cuda_single_gpu_tree_learner.cu

@@ -167,7 +169,7 @@ void CUDASingleGPUTreeLearner::LaunchReduceLeafStatKernel(

 template <typename T, bool IS_INNER>
 __global__ void CalcBitsetLenKernel(const CUDASplitInfo* best_split_info, size_t* out_len_buffer) {
-  __shared__ size_t shared_mem_buffer[32];
+  __shared__ size_t shared_mem_buffer[WARPSIZE];


Should we also adjust the code that is relying on warpsize is always 32? For example, here:

LightGBM/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu

Lines 181 to 183 in 60b0155

len = (val / 32) + 1;

}

const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x);

StrikerRUS · 2024-12-24T21:51:49Z

src/treelearner/cuda/cuda_data_partition.cu

@@ -747,7 +749,7 @@ __global__ void AggregateBlockOffsetKernel1(
  data_size_t* block_to_right_offset_buffer, data_size_t* cuda_leaf_data_start,
  data_size_t* cuda_leaf_data_end, data_size_t* cuda_leaf_num_data, const data_size_t* cuda_data_indices,
  const data_size_t num_blocks) {
-  __shared__ uint32_t shared_mem_buffer[32];
+  __shared__ uint32_t shared_mem_buffer[WARPSIZE];


Should WARPSIZE be also used here?

LightGBM/src/treelearner/cuda/cuda_data_partition.cu

Line 1081 in 60b0155

__shared__ double shared_mem_buffer[32];

StrikerRUS · 2024-12-24T21:57:01Z

src/objective/cuda/cuda_rank_objective.cu

@@ -354,7 +358,7 @@ void CUDALambdarankNDCG::LaunchGetGradientsKernel(const double* score, score_t*
    }
  } else {
    BitonicArgSortItemsGlobal(score, num_queries_, cuda_query_boundaries_, cuda_item_indices_buffer_.RawData());
-    if (num_rank_label <= 32) {
+    if (num_rank_label <= 32 && device_prop.warpSize == 32) {


Should we adjust the following code for warpsize other than 32?

LightGBM/src/objective/cuda/cuda_rank_objective.cu

Lines 407 to 408 in 60b0155

// assert that warpSize == 32

__shared__ double shared_buffer[32];

LightGBM/src/objective/cuda/cuda_rank_objective.cu

Lines 525 to 526 in 60b0155

// assert that warpSize == 32, so we use buffer size 1024 / 32 = 32

__shared__ double shared_buffer[32];

jeffdaily added 4 commits September 7, 2023 20:48

[ROCm] add support for ROCm/HIP

01ff268

- CMakeLists.txt ROCm updates, also replace glob with explicit file list - initial warpSize interop changes - helpers/hipify.sh script added - .gitignore to ignore generated hip source files

more rocm updates

ac966a5

- disable compiler warnings - move PercentileDevice __device__ template function into header - bug fixes for __host__ __define__ and __HIP__ preprocessor symbols

more bug fixes

6ae6432

warp 32 vs 64 updates

5b87bcd

jeffdaily requested review from guolinke, jameslamb, shiyu1994 and jmoralez as code owners September 8, 2023 16:21

jameslamb previously requested changes Sep 8, 2023

View reviewed changes

jameslamb added in progress feature labels Sep 8, 2023

lint fixes

3ad89f8

missing device_index variable

7b8b6a0

accidental inclusion of hip headers

62aa30b

jeffdaily and others added 3 commits September 11, 2023 16:21

copyright notice compliance

bb27c55

Merge branch 'master' into rocm2

58ace9c

Merge branch 'master' into rocm2

0bc1cfb

Merge branch 'master' into rocm2

a7c9653

Merge branch 'master' into rocm2

9ba27bb

Merge branch 'master' into rocm2

5ba59b8

jameslamb requested review from jameslamb and removed request for jameslamb April 16, 2024 01:30

fix conflicts

c0abd17

shiyu1994 requested review from borchero and StrikerRUS as code owners October 28, 2024 04:37

jameslamb requested changes Oct 28, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

Update CMakeLists.txt

e7129a0

Co-authored-by: James Lamb <[email protected]>

shiyu1994 added 2 commits December 17, 2024 11:42

Merge branch 'master' into rocm2

eb0036f

fix lint issue

dbd972e

shiyu1994 added the awaiting review label Dec 17, 2024

shiyu1994 approved these changes Dec 17, 2024

View reviewed changes

StrikerRUS requested changes Dec 17, 2024

View reviewed changes

StrikerRUS removed the in progress label Dec 17, 2024

shiyu1994 added 5 commits December 18, 2024 13:30

Merge branch 'master' into rocm2

4cd0dea

Merge branch 'master' into rocm2

3ad2482

clean up

8f6600e

Merge branch 'rocm2' of https://github.com/jeffdaily/LightGBM into HEAD

47fc353

Merge branch 'master' into rocm2

2e8869c

shiyu1994 requested review from StrikerRUS and jameslamb December 24, 2024 07:46

StrikerRUS reviewed Dec 24, 2024

View reviewed changes

	option(USE_ROCM "Enable ROCM-accelerated training " OFF)
	option(USE_ROCM "Enable ROCm-accelerated training " OFF)

	set(USE_OPENMP ON CACHE BOOL "ROCM requires OpenMP" FORCE)
	set(USE_OPENMP ON CACHE BOOL "ROCm requires OpenMP" FORCE)

	len = (val / 32) + 1;
	}
	const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x);

	// assert that warpSize == 32
	__shared__ double shared_buffer[32];

	// assert that warpSize == 32, so we use buffer size 1024 / 32 = 32
	__shared__ double shared_buffer[32];


		if(USE_ROCM OR USE_CUDA)

[ROCm] add support for ROCm/HIP device #6086

Are you sure you want to change the base?

[ROCm] add support for ROCm/HIP device #6086

Conversation

jeffdaily commented Sep 8, 2023

jameslamb left a comment

Choose a reason for hiding this comment

shiyu1994 commented Sep 8, 2023

jeffdaily commented Sep 8, 2023

jeffdaily commented Sep 8, 2023

ibustany commented Sep 8, 2023

jameslamb commented Sep 8, 2023

jeffdaily commented Sep 8, 2023

jeffdaily commented Sep 8, 2023

shiyu1994 commented Sep 13, 2023

jameslamb commented Sep 13, 2023

jameslamb commented Sep 13, 2023

shiyu1994 commented Oct 8, 2023

shiyu1994 commented Dec 1, 2023 • edited Loading

StrikerRUS commented Oct 8, 2024

shiyu1994 commented Oct 28, 2024

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Oct 29, 2024

shiyu1994 left a comment

Choose a reason for hiding this comment

shiyu1994 commented Dec 17, 2024

shiyu1994 commented Dec 17, 2024

StrikerRUS left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Dec 17, 2024

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyu1994 commented Dec 1, 2023 •

edited

Loading

StrikerRUS left a comment •

edited

Loading

StrikerRUS Dec 17, 2024 •

edited

Loading

StrikerRUS Dec 24, 2024 •

edited

Loading