-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281
Comments
sounds great @kenibrewer! I think this would be a very much welcomed enhancement. We have also considered adding pyarrow support (via pandas 2.0.1 https://pandas.pydata.org/docs/user_guide/pyarrow.html#i-o-reading) Do you have any thoughts on how to proceed with this development? For example, maybe we could use a specific I think this development is also quite timely, given the single-cell emphasis @d33bs and others are implementing over in CytoTable. https://github.com/cytomining/CytoTable |
I think Modin could be very potent depending on how it's used within Pycytominer. One thing I recommend keeping in mind is the variance of support for especially the pd.DataFrame API through Modin. Many methods have only Ray or Dask support, or somewhat different implementations depending on the engine which is used. Some of the implementations might also involve rewriting existing code to match the expectations of Modin (or Dask / Ray)(for example, see Pandas over Arrow perspective: to my knowledge based on some digging, there's no easy "drop-in" replacement for arrow with Pandas dataframes right now. My hope was there'd be an option, something like Instead this appears to be dataframe-to-dataframe via Pandas reader/writer settings, meaning we could do something like overload readers and writers with new defaults for Some related links below: |
Certain pycytominer functions run into scaling issues for extremely large datasets. This can be solved by enabling an optional install of modin which is a drop-in replacement for pandas that leverages either ray, dask, or unidist backend engines to automatically shard data and parallelize operations across many cpus or instances.
When I first started using pycytominer, I didn't realize that feature-selection was supposed to happen against well-aggregated profiles and I attempted to run feature-selection on a profiles dataframe with millions of rows. I let the feature_selection step run for 4 hours before killing it. To solve the "problem" I forked pycytominer and did simple "import modin.pandas as pd" replacements in all the files. When I tried to re-run feature_selection it completed in under 60 seconds after it properly scaled to all 16 cpus of the instance I was using. I later realized my mistake in approach and left the branch by the wayside.
There were some problems with my quickly hacked together solution (especially with
annotate
) and there were problems with certain functions that manually detect datatypes (e.g.if type(obj) == pd.Dataframe
checks). But it may be worth systematically tackling those issues to get this functionality to work.The text was updated successfully, but these errors were encountered: