Understanding how Sequence Works #6656

jamespinkerton · 2024-09-28T21:06:55Z

Hi. I looked at #4672 and found the advice when your dataset is too large is to use Sequence. However I can't find good documentation on Sequence and I'm having trouble understanding how it works.

My use case is I have multiple files on google cloud storage of floating point numbers. Each file has all of the features, but a different range of the samples. Because they're floats of 4 bytes, I can't put the entire dataset onto my machine due to lack of memory. However, I can fit it in once it's a dataset because.

I was hoping I could write a custom sequence class that downloaded these files when pinged, but when I do this I get lots of random access requests and I can't download the data that many times.

I was hoping for some advice on how the Sequence API works.
Do I need to provide a list of sequences?
Does the batch size refer to the number of samples returned at each index, or does it refer to the requested total number of samples at a time?
Is there a way to download the data in one stream, or does the dataset have to see the data multiple times to be constructed?
Does Sequence not actually solve my problem?

Thanks so much.

jameslamb · 2024-09-30T03:42:37Z

Thanks for using LightGBM.

A minimal, reproducible example of what you tried would be helpful (here are some docs on how to do that). For example, you didn't tell us what version of lightgbm you're using, what operating system, etc. Please do that in future reports.

The lightgbm.Sequence class is an abstract class, which shows the API that you need to implement. Here are some resources you might find helpful:

the source code:

LightGBM/python-package/lightgbm/basic.py

Line 889 in 59a3432

class Sequence(abc.ABC):
example in this repo's examples (using multiple hdf5 files) https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py
in-memory numpy implementation used in unit tests:

LightGBM/tests/python_package_test/test_basic.py

Line 98 in 59a3432

class NumpySequence(lgb.Sequence):

I think Sequence is a good way to accomplish what you're trying to accomplish.

But it's been a while since I personally worked with this API, so I can't provide a reproducible example right now. When I find time, if no one else has answered your questions by then, I'll try to create one with the publicly-available data on S3 from https://github.com/ContinuumIO/anaconda-package-data to demonstrate how to do what you're trying to do.

jamespinkerton · 2024-09-30T04:01:00Z

I think you found some documentation I couldn't find. This is very helpful, thank you. Given that I have to randomly sample the data, it looks like Sequence presents some challenges. My data comes in chunks of 1 million samples / chunk, and I have about 200 or so chunks. I'm not sure if there's an easy way to do the random sampling part given this constraint?

jameslamb added the question label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding how Sequence Works #6656

Understanding how Sequence Works #6656

jamespinkerton commented Sep 28, 2024 •

edited

Loading

jameslamb commented Sep 30, 2024

jamespinkerton commented Sep 30, 2024

Understanding how Sequence Works #6656

Understanding how Sequence Works #6656

Comments

jamespinkerton commented Sep 28, 2024 • edited Loading

jameslamb commented Sep 30, 2024

jamespinkerton commented Sep 30, 2024

jamespinkerton commented Sep 28, 2024 •

edited

Loading