-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] add support for ROCm/HIP device #6086
base: master
Are you sure you want to change the base?
Conversation
- CMakeLists.txt ROCm updates, also replace glob with explicit file list - initial warpSize interop changes - helpers/hipify.sh script added - .gitignore to ignore generated hip source files
- disable compiler warnings - move PercentileDevice __device__ template function into header - bug fixes for __host__ __define__ and __HIP__ preprocessor symbols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your interest in LightGBM. Since I'm not aware of any prior conversation in this project about adding support like this, we have some questions before spending time supporting this.
- what is ROCm/HIP? Where can we read to learn more?
- what is the value of this addition to LightGBM's users? What does this offer that the OpenCL-based and CUDA-based builds of LightGBM don't already offer?
- this project's OpenCL-based GPU build is already struggling from a severe lack of maintenance... I'm very skeptical of taking on a third GPU build
- how might we test this? What types of devices should we expect to be supported?
@jeffdaily Thank you, this is very exciting! @jameslamb ROCm is the counterpart of CUDA for AMD GPU. I don't have any prior discussion with @jeffdaily about this. But it is very exciting if we can enlarge the devices supported by LightGBM. |
Apologies for coming out of nowhere with this. We use LightGBM; the OpenCL-based 'gpu' device already works on our AMD GPUs. But we were curious if we could get better performance if we ported the 'cuda' device to AMD GPUs. This started as a proof of concept, but it seemed useful to share even in its current state. Using the GPU-Tutorial, here are my results on our MI210.
|
https://rocm.docs.amd.com/en/latest/rocm.html
See the perf results from the comment above.
Here is the current list of supported AMD GPUs. To test this, you'll need to run on one of the supported AMD GPUs. How is the cuda device currently tested? |
Thank you and kudos Jeff! |
We run a VM in Azure with a Tesla V100 on it, and schedule jobs onto it via GitHub Actions.
Are you aware of any free CI service supporting AMD GPUs? Otherwise, since I see you work for AMD and since merging this might further AMD's interests... would AMD maybe be willing to fund testing resources for this project? Maybe that's something you and @shiyu1994 (the only maintainer here who's employed by Microsoft) could coordinate? |
Microsoft does have an AMD GPU deployment. I'm aware of it being used for onnxruntime CI purposes. I wonder if some of those resources could be used here? @shiyu1994? |
Noting that the only CI failure currently is not related to my changes. It seems to be a perhaps temporary environment setup issue for that job. |
I have access to some AMD MI100 GPUs. But we still need separate budget for an agent with an AMD GPU if we want to test automatically in ci. Do you think it is acceptable if I run the tests for AMD GPU offline without an additional agent for ci? Given that the code for GPU version is shared by both CUDA and ROCm. @jameslamb @guolinke @jeffdaily. |
If you feel confident in these changes based on that, and you think the added complexity in the CUDA code is worth it, that's fine with me. I'll defer to your opinion. But without a CI job, there's a high risk that future refactorings will break this support again. |
I dismissed my review, so that it doesn't block merging. My initial questions have been answered, thanks very much for those links and all that information! @shiyu1994 and @guolinke seem excited about this addition... that's good enough for me 😊 I'll defer to them to review the code, as I know very little about CUDA. |
@jeffdaily Thanks for the great work! I'll review this in the next few days. |
Thanks again for the contribution. I just got a Windows server with AMD MI25 GPU. I'm trying to use that server as a CI agent. Hopefully it won't be difficult. |
It's a pity that such wonderful PR was abandoned! 😢 Quite interesting that HIP code can be run on NVIDIA cards! I believe that we'll be able to run HIP code on our NVIDIA CI machine. It's not perfect and doesn't guarantee that code works well on AMD, but at least it guarantee that code isn't broken. |
I'm picking this up. Let's merge this recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 thanks for picking this up!
I left one quick blocking suggestion, but haven't otherwise reviewed this. Will you please @
me once CI is passing? I can give a more thorough review then.
Co-authored-by: James Lamb <[email protected]>
That's just awesome! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffdaily Thanks for your contribution. Will wait for other reviewers for more comments.
@jameslamb Hi James, you may review this now. The CI issues have been fixed. |
What about we enable this with a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 Thanks a lot for pushing this PR forward. I left some initial comments about CMake and CI.
.ci/hipify.sh
Outdated
do | ||
find ${DIR} -name "*.${EXT}" -exec sh -c ' | ||
echo "hipifying $1 in-place" | ||
hipify-perl "$1" -inplace & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we get hipify-perl
script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is installed when installing HIP.
https://github.com/ROCm/HIP/blob/master/INSTALL.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
.ci/hipify.sh
Outdated
@@ -0,0 +1,16 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file should be added in a follow-up PR in which we'll enable hipifying at our CI or will request users hipify localy before suggesting a CUDA-related PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We can postpone this to the next PR for ROCm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
CMakeLists.txt
Outdated
message(STATUS "ALLFEATS_DEFINES: ${ALLFEATS_DEFINES}") | ||
message(STATUS "FULLDATA_DEFINES: ${FULLDATA_DEFINES}") | ||
|
||
function(add_histogram hsize hname hadd hconst hdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this function differs from existing one for CUDA? Can we reuse it or merge these two functions into one?
Line 275 in 480600b
function(add_histogram hsize hname hadd hconst hdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The histogram*.cu files are only used with USE_GPU=ON, we can remove this actually. I'm not sure why they appear in USE_CUDA at current commit. Maybe we should move it into an if (USE_GPU) clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. They are not used with the USE_GPU version. Instead, they are used in the old CUDA version. Given that version has already been dropped. We can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
CMakeLists.txt
Outdated
) | ||
endfunction() | ||
|
||
foreach(hsize _16_64_256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for add_histogram
function. Can we [incapsulate this for-loop into a function and] reuse it with CUDA and HIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
CMakeLists.txt
Outdated
endforeach() | ||
endif() | ||
|
||
if(USE_HDFS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDFS support was dropped some time ago. This if
block should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8f6600e.
CMakeLists.txt
Outdated
target_link_libraries(_lightgbm PRIVATE ${histograms}) | ||
endif() | ||
|
||
if(USE_HDFS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8f6600e.
CMakeLists.txt
Outdated
@@ -644,6 +729,20 @@ if(USE_CUDA) | |||
target_link_libraries(_lightgbm PRIVATE ${histograms}) | |||
endif() | |||
|
|||
if(USE_ROCM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge CUDA and HIP with if( USE_CUDA OR USE_ROCM)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8f6600e.
Yeah, I support separating PRs: this one with modifications of CUDA files and CMake, a following-up PR with CI jobs for ROCm and hipifying scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the code!
I think this PR is blocked by #6766.
Also, I searched the CUDA code in the repo for the literals 32
and 64
and left some comments in places where warpsize can potentially be adjusted.
@@ -4,6 +4,7 @@ option(USE_GPU "Enable GPU-accelerated training" OFF) | |||
option(USE_SWIG "Enable SWIG to generate Java API" OFF) | |||
option(USE_TIMETAG "Set to ON to output time costs" OFF) | |||
option(USE_CUDA "Enable CUDA-accelerated training " OFF) | |||
option(USE_ROCM "Enable ROCM-accelerated training " OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option(USE_ROCM "Enable ROCM-accelerated training " OFF) | |
option(USE_ROCM "Enable ROCm-accelerated training " OFF) |
@@ -160,6 +161,11 @@ if(USE_CUDA) | |||
set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE) | |||
endif() | |||
|
|||
if(USE_ROCM) | |||
enable_language(HIP) | |||
set(USE_OPENMP ON CACHE BOOL "ROCM requires OpenMP" FORCE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(USE_OPENMP ON CACHE BOOL "ROCM requires OpenMP" FORCE) | |
set(USE_OPENMP ON CACHE BOOL "ROCm requires OpenMP" FORCE) |
if(USE_ROCM) | ||
find_package(HIP) | ||
include_directories(${HIP_INCLUDE_DIRS}) | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__HIP_PLATFORM_AMD__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the same as
-DCMAKE_HIP_PLATFORM=amd
?
https://cmake.org/cmake/help/latest/variable/CMAKE_HIP_PLATFORM.html#variable:CMAKE_HIP_PLATFORM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also set HIP_ARCHITECTURES
?
For NVIDIA, are they reused from CUDA_ARCHITECTURES
?
|
||
add_definitions(-DUSE_CUDA) | ||
|
||
set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used. See #6766 (review).
|
||
if(USE_ROCM OR USE_CUDA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used. See #6766 (review).
#define __shfl_down_sync(mask, val, offset) __shfl_down(val, offset) | ||
#define __shfl_up_sync(mask, val, offset) __shfl_up(val, offset) | ||
// ROCm warpSize is constexpr and is either 32 or 64 depending on gfx arch. | ||
#define WARPSIZE warpSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should WARPSIZE
be also used here?
__shared__ score_t shared_mem_buffer[32]; |
__shared__ score_t shared_mem_buffer[32]; |
@@ -742,7 +744,7 @@ __global__ void FixHistogramKernel( | |||
const int* cuda_need_fix_histogram_features, | |||
const uint32_t* cuda_need_fix_histogram_features_num_bin_aligned, | |||
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits) { | |||
__shared__ hist_t shared_mem_buffer[32]; | |||
__shared__ hist_t shared_mem_buffer[WARPSIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should WARPSIZE
be also used here?
__shared__ int64_t shared_mem_buffer[32]; |
@@ -167,7 +169,7 @@ void CUDASingleGPUTreeLearner::LaunchReduceLeafStatKernel( | |||
|
|||
template <typename T, bool IS_INNER> | |||
__global__ void CalcBitsetLenKernel(const CUDASplitInfo* best_split_info, size_t* out_len_buffer) { | |||
__shared__ size_t shared_mem_buffer[32]; | |||
__shared__ size_t shared_mem_buffer[WARPSIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also adjust the code that is relying on warpsize is always 32? For example, here:
LightGBM/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu
Lines 181 to 183 in 60b0155
len = (val / 32) + 1; | |
} | |
const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x); |
@@ -747,7 +749,7 @@ __global__ void AggregateBlockOffsetKernel1( | |||
data_size_t* block_to_right_offset_buffer, data_size_t* cuda_leaf_data_start, | |||
data_size_t* cuda_leaf_data_end, data_size_t* cuda_leaf_num_data, const data_size_t* cuda_data_indices, | |||
const data_size_t num_blocks) { | |||
__shared__ uint32_t shared_mem_buffer[32]; | |||
__shared__ uint32_t shared_mem_buffer[WARPSIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should WARPSIZE
be also used here?
__shared__ double shared_mem_buffer[32]; |
@@ -354,7 +358,7 @@ void CUDALambdarankNDCG::LaunchGetGradientsKernel(const double* score, score_t* | |||
} | |||
} else { | |||
BitonicArgSortItemsGlobal(score, num_queries_, cuda_query_boundaries_, cuda_item_indices_buffer_.RawData()); | |||
if (num_rank_label <= 32) { | |||
if (num_rank_label <= 32 && device_prop.warpSize == 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we adjust the following code for warpsize other than 32?
LightGBM/src/objective/cuda/cuda_rank_objective.cu
Lines 407 to 408 in 60b0155
// assert that warpSize == 32 | |
__shared__ double shared_buffer[32]; |
LightGBM/src/objective/cuda/cuda_rank_objective.cu
Lines 525 to 526 in 60b0155
// assert that warpSize == 32, so we use buffer size 1024 / 32 = 32 | |
__shared__ double shared_buffer[32]; |
To build for ROCm:
CUDA source files are hipified in-place using the helper script before running cmake. The "cuda" device is re-used for rocm, so device=cuda will work the same for rocm builds.
Summary of changes:
__device__
template function PercentileDevice into header__host__ __define__