[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

kenibrewer · 2023-05-11T16:19:45Z

Certain pycytominer functions run into scaling issues for extremely large datasets. This can be solved by enabling an optional install of modin which is a drop-in replacement for pandas that leverages either ray, dask, or unidist backend engines to automatically shard data and parallelize operations across many cpus or instances.

When I first started using pycytominer, I didn't realize that feature-selection was supposed to happen against well-aggregated profiles and I attempted to run feature-selection on a profiles dataframe with millions of rows. I let the feature_selection step run for 4 hours before killing it. To solve the "problem" I forked pycytominer and did simple "import modin.pandas as pd" replacements in all the files. When I tried to re-run feature_selection it completed in under 60 seconds after it properly scaled to all 16 cpus of the instance I was using. I later realized my mistake in approach and left the branch by the wayside.

There were some problems with my quickly hacked together solution (especially with annotate) and there were problems with certain functions that manually detect datatypes (e.g. if type(obj) == pd.Dataframe checks). But it may be worth systematically tackling those issues to get this functionality to work.

The text was updated successfully, but these errors were encountered:

gwaybio · 2023-05-15T18:38:54Z

sounds great @kenibrewer! I think this would be a very much welcomed enhancement.

We have also considered adding pyarrow support (via pandas 2.0.1 https://pandas.pydata.org/docs/user_guide/pyarrow.html#i-o-reading)

Do you have any thoughts on how to proceed with this development? For example, maybe we could use a specific develop branch (e.g. develop-parallel) and all work toward this goal could be placed there?

I think this development is also quite timely, given the single-cell emphasis @d33bs and others are implementing over in CytoTable. https://github.com/cytomining/CytoTable

d33bs · 2023-06-21T16:57:04Z

I think Modin could be very potent depending on how it's used within Pycytominer. One thing I recommend keeping in mind is the variance of support for especially the pd.DataFrame API through Modin. Many methods have only Ray or Dask support, or somewhat different implementations depending on the engine which is used. Some of the implementations might also involve rewriting existing code to match the expectations of Modin (or Dask / Ray)(for example, see merge on this page).

Pandas over Arrow perspective: to my knowledge based on some digging, there's no easy "drop-in" replacement for arrow with Pandas dataframes right now. My hope was there'd be an option, something like pd.options.dtype_backend = "pyarrow". This doesn't seem to exist yet.

Instead this appears to be dataframe-to-dataframe via Pandas reader/writer settings, meaning we could do something like overload readers and writers with new defaults for dtype_backend or engine. This feels complex, and is I feel likely to evolve in the near future.

Some related links below:

kenibrewer changed the title ~~[Enhancement] Improve performance/scaling by enabling an optional install of the pandas drop-in replacement modin.~~ [Enhancement] Improve performance/scaling by enabling an optional install of modin. May 11, 2023

d33bs added this to SET Projects Jun 21, 2023

d33bs moved this to Todo in SET Projects Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

kenibrewer commented May 11, 2023

gwaybio commented May 15, 2023

d33bs commented Jun 21, 2023

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281

Comments

kenibrewer commented May 11, 2023

gwaybio commented May 15, 2023

d33bs commented Jun 21, 2023