Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

Open
kenibrewer opened this issue May 11, 2023 · 2 comments

Comments

@kenibrewer
Copy link
Member

Certain pycytominer functions run into scaling issues for extremely large datasets. This can be solved by enabling an optional install of modin which is a drop-in replacement for pandas that leverages either ray, dask, or unidist backend engines to automatically shard data and parallelize operations across many cpus or instances.

When I first started using pycytominer, I didn't realize that feature-selection was supposed to happen against well-aggregated profiles and I attempted to run feature-selection on a profiles dataframe with millions of rows. I let the feature_selection step run for 4 hours before killing it. To solve the "problem" I forked pycytominer and did simple "import modin.pandas as pd" replacements in all the files. When I tried to re-run feature_selection it completed in under 60 seconds after it properly scaled to all 16 cpus of the instance I was using. I later realized my mistake in approach and left the branch by the wayside.

There were some problems with my quickly hacked together solution (especially with annotate) and there were problems with certain functions that manually detect datatypes (e.g. if type(obj) == pd.Dataframe checks). But it may be worth systematically tackling those issues to get this functionality to work.

@kenibrewer kenibrewer changed the title [Enhancement] Improve performance/scaling by enabling an optional install of the pandas drop-in replacement modin. [Enhancement] Improve performance/scaling by enabling an optional install of modin. May 11, 2023
@gwaybio
Copy link
Member

gwaybio commented May 15, 2023

sounds great @kenibrewer! I think this would be a very much welcomed enhancement.

We have also considered adding pyarrow support (via pandas 2.0.1 https://pandas.pydata.org/docs/user_guide/pyarrow.html#i-o-reading)

Do you have any thoughts on how to proceed with this development? For example, maybe we could use a specific develop branch (e.g. develop-parallel) and all work toward this goal could be placed there?

I think this development is also quite timely, given the single-cell emphasis @d33bs and others are implementing over in CytoTable. https://github.com/cytomining/CytoTable

@d33bs d33bs moved this to Todo in SET Projects Jun 21, 2023
@d33bs
Copy link
Member

d33bs commented Jun 21, 2023

I think Modin could be very potent depending on how it's used within Pycytominer. One thing I recommend keeping in mind is the variance of support for especially the pd.DataFrame API through Modin. Many methods have only Ray or Dask support, or somewhat different implementations depending on the engine which is used. Some of the implementations might also involve rewriting existing code to match the expectations of Modin (or Dask / Ray)(for example, see merge on this page).

Pandas over Arrow perspective: to my knowledge based on some digging, there's no easy "drop-in" replacement for arrow with Pandas dataframes right now. My hope was there'd be an option, something like pd.options.dtype_backend = "pyarrow". This doesn't seem to exist yet.

Instead this appears to be dataframe-to-dataframe via Pandas reader/writer settings, meaning we could do something like overload readers and writers with new defaults for dtype_backend or engine. This feels complex, and is I feel likely to evolve in the near future.

Some related links below:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants