Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Improvement in Training Time with more Cores on LightGBM #6730

Open
abhishekagrawala opened this issue Nov 25, 2024 · 4 comments
Open

No Improvement in Training Time with more Cores on LightGBM #6730

abhishekagrawala opened this issue Nov 25, 2024 · 4 comments

Comments

@abhishekagrawala
Copy link

abhishekagrawala commented Nov 25, 2024

Description

Training a 6GB dataset with LightGBM using n_jobs=70 does not result in a proportional reduction in training time. Despite utilizing a machine with 72 cores and setting a high n_jobs value, the training time remains unexpectedly high.

Environment

OS: Linux 6.1.0-27-cloud-amd64 Debian
CPU:
  Architecture:             x86_64
  CPU(s):                   72  
    - Model Name:           Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz  
    - Cores:                72 (1 thread per core)  
    - Flags:                AVX, AVX2, AVX512, FMA, etc.  
  Memory:                   288 MB L2 Cache, 16 MB L3 Cache  
  NUMA Node(s):             1  
Memory:
               		total        used        free      shared  buff/cache   available  
      Mem:           491Gi        81Gi       399Gi       1.1Mi        15Gi       410Gi  
      Swap:           79Gi        84Mi        79Gi  
Storage:
  Filesystem      Size  Used Avail Use% Mounted on  
  udev            246G     0  246G   0% /dev  
  tmpfs            50G  1.8M   50G   1% /run  
  /dev/sda1       197G  104G   86G  55% /  
  tmpfs           246G     0  246G   0% /dev/shm  
  tmpfs           5.0M     0  5.0M   0% /run/lock  
  /dev/sda15      124M   12M  113M  10% /boot/efi  
  tmpfs            50G     0   50G   0% /run/user/10476  
  tmpfs            50G     0   50G   0% /run/user/90289  
  tmpfs            50G     0   50G   0% /run/user/1003  
VM Type: Custom VM on a cloud environment.

LightGBM Setup

  Version: 3.2.1=py38h709712a_0
  Parameters: n_estimators=325, num_leaves=512, colsample_bytree=0.2, min_data_in_leaf=80, max_depth=22, learning_rate=0.09, objective="binary", n_jobs=70, boost_from_average=True, max_bin=200, bagging_fraction=0.999, lambda_l1=0.29, lambda_l2=0.165
Dataset:
  Size: ~6GB
  Characteristics: Binary classification problem, categorical and numerical features, preprocessed and balanced.
Performance Issues
Current Performance:
  	Training time with n_jobs=32: ~25 minutes
	Training time with n_jobs=70: ~23 minutes
Expected Performance:
	Substantial reduction in training time when utilizing 70 cores, ideally below 10 minutes.
Bottleneck Symptoms:
	Minimal reduction in training time with increased cores (n_jobs).
	CPU utilization remains low, with individual threads not fully utilized.
System Metrics During Training
   CPU Utilization:
      Average utilization: ~40%  
      Peak utilization: ~55%  
      Core-specific activity: Most cores show low activity levels (<30%)  
   Memory Usage:
      Utilized during training: ~81Gi  
      Free memory: ~399Gi  
      Swap usage: ~84Mi  
  Disk I/O:
      Read: ~50MB/s  
      Write: ~30MB/s  
      I/O wait time: ~2%

Request for Support

Explanation of why n_jobs scaling is not improving training time.

Suggestions for configurations to fully utilize 70 cores for LightGBM training.

Recommendations for debugging and monitoring specific to LightGBM threading or system-level bottlenecks.

@jameslamb jameslamb changed the title Bug Report: No Improvement in Training Time with more Cores on LightGBM No Improvement in Training Time with more Cores on LightGBM Nov 25, 2024
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM. I've attempted to reformat your post a bit to make it easier to read... if you are new to markdown / GitHub, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some tips on making such changes yourself.

You haven't provided enough information yet for us to help you with this report.

  • can you provide a minimal, reproducible example (docs on that) showing the exact code you're running and how you installed LightGBM?
    • You haven't even told us whether you're using the Python package, R package, CLI, etc.
    • You haven't told us anything about the shape and content of the dataset, other than it's total size in memory. CPU utilization is heavily dependent on the shape of the input data (e.g. number of rows and columnes) and the distribution of the features (e.g. cardinality of categorical values)
  • are there any other processes running on the system?
    • If you're trying to devote all cores to LightGBM training, they'll be competing with any other work happening on the system.
  • have I understood correctly that you're using LightGBM 3.2.1?
    • if so, please try updating to the latest version (v4.5.0) and tell us if that changes the results. There have been hundreds of bug fixes and improvements in the 4+ years of development between those 2 versions

@abhishekagrawala
Copy link
Author

Packages in Environment:
_libgcc_mutex: 0.1, build: main
_openmp_mutex: 5.1, build: 1_gnu
beautifulsoup4: 4.12.3, source: pypi
boto3: 1.35.73, source: pypi
botocore: 1.35.73, source: pypi
bzip2: 1.0.8, build: h5eee18b_6
ca-certificates: 2024.11.26, build: h06a4308_0
cachetools: 5.5.0, source: pypi
certifi: 2024.8.30, source: pypi
charset-normalizer: 3.4.0, source: pypi
cramjam: 2.9.0, source: pypi
fastparquet: 2024.11.0, source: pypi
fsspec: 2024.10.0, source: pypi
google: 3.0.0, source: pypi
google-api-core: 2.23.0, source: pypi
google-auth: 2.36.0, source: pypi
google-cloud: 0.34.0, source: pypi
google-cloud-core: 2.4.1, source: pypi
google-cloud-storage: 2.18.2, source: pypi
google-crc32c: 1.6.0, source: pypi
google-resumable-media: 2.7.2, source: pypi
googleapis-common-protos: 1.66.0, source: pypi
idna: 3.10, source: pypi
jmespath: 1.0.1, source: pypi
ld_impl_linux-64: 2.40, build: h12ee557_0
libffi: 3.4.4, build: h6a678d5_1
libgcc-ng: 11.2.0, build: h1234567_1
libgomp: 11.2.0, build: h1234567_1
libstdcxx-ng: 11.2.0, build: h1234567_1
libuuid: 1.41.5, build: h5eee18b_0
lightgbm: 4.5.0, source: pypi
lightgbmmodeloptimizer: 0.0.6, source: pypi
ncurses: 6.4, build: h6a678d5_0
numpy: 2.1.3, source: pypi
openssl: 3.0.15, build: h5eee18b_0
packaging: 24.2, source: pypi
pandas: 2.2.3, source: pypi
pip: 24.2, build: py310h06a4308_0
proto-plus: 1.25.0, source: pypi
protobuf: 5.29.0, source: pypi
pyasn1: 0.6.1, source: pypi
pyasn1-modules: 0.4.1, source: pypi
python: 3.10.15, build: he870216_1
python-dateutil: 2.9.0.post0, source: pypi
pytz: 2024.2, source: pypi
readline: 8.2, build: h5eee18b_0
requests: 2.32.3, source: pypi
rsa: 4.9, source: pypi
s3transfer: 0.10.4, source: pypi
scipy: 1.14.1, source: pypi
setuptools: 75.1.0, build: py310h06a4308_0
six: 1.16.0, source: pypi
soupsieve: 2.6, source: pypi
sqlite: 3.45.3, build: h5eee18b_0
tk: 8.6.14, build: h39e8969_0
tzdata: 2024.2, source: pypi
urllib3: 2.2.3, source: pypi
wheel: 0.44.0, build: py310h06a4308_0
xz: 5.4.6, build: h5eee18b_1
zlib: 1.2.13, build: h5eee18b_1

Data Summary:
Number of rows: 54,789,701
Number of columns: 64

#classification model
Data Types and Their Counts:
int64: 18
float64: 8
float32: 6
category: 26

Numerical Features Summary:
Feature 1:
Mean: 20,410.69
Std: 40,933.77
Min: 0.0
25th Percentile: 280.0
Median (50th Percentile): 2,776.0
75th Percentile: 15,535.0
Max: 222,632.0

Feature 2:
Mean: 6,563.657
Std: 18,636.79
Min: 0.0
25th Percentile: 43.0
Median (50th Percentile): 478.0
75th Percentile: 3,471.0
Max: 151,994.0
(Details for other features truncated for brevity.)

Categorical Features:
Feature 1:
Count: 54,789,701
Unique values: 17,547
Most frequent value: 772610812
Frequency of most frequent value: 3,150,663

Feature 2:
Count: 54,789,701
Unique values: 82,295
Most frequent value: 629834434
Frequency of most frequent value: 876,873

Cardinality:
Feature 2: 331,307
Feature 3: 49,931
Feature 4: 677
Feature 5: 82,295
Feature 6: 12,093
Feature 7: 10,642
Feature 8: 7,193
Feature 9: 24
Feature 10: 33
Feature 11: 304
Feature 12: 1,085
Feature 13: 2
Feature 14: 40,754
Feature 15: 5
Feature 16: 3
Feature 17: 12
Feature 18: 26
Feature 19: 2,313

Total Memory Usage:
32,113.3 MB

@jameslamb
Copy link
Collaborator

Thanks for that. So it looks like you're actually using LightGBM 4.5.0, and working in Python (given all those Python libraries in the output you shared).

Still though, please:

can you provide a minimal, reproducible example (docs on that) showing the exact code you're running and how you installed LightGBM?

@shiyu1994
Copy link
Collaborator

What are the number of training samples and the number of features in your dataset? It could be in expectation if the number of samples/features is not large, row-wse/col-wise parallelism with many threads does not improve efficiency. In addition, could you provide the log and parameter settings for your evaluation result so that we can identify the parallelism mode used (row-wise/col-wise)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants