-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pyarrow a required dependency #52509
Comments
I'd be in favour I've heard people raise objections because pyarrow is too heavy of a dependency - personally, I'd require it anyway, I don't think we should hold the project back because of this, just bringing it up for completeness |
+1 from me - it's time |
+1 I think main objections were not to do it yet as it was just fine to have it as a an optional dependency (see #50285). But if having it optional is an impediment. Maybe PyArrow dependencies aren't ideal, but this should only be a temporary problem if PyArrow packages are expected to become more modular. |
Are we planning on using Arrow Cython integration/C API stuff? If not, requiring should be trivial on packaging side should be trivial (so no objections from me there). Since from the few comments so far, most people seem to be in favor,
|
As long as it is not actually required (at the moment, it's strictly speaking an optional dependency, as you can perfectly fine use pandas without needing pyarrow, if you don't use any IO method relying on it and if you don't use the arrow-based dtypes), I don't see a good reason to make it an actual hard dependency. The mentioned reasons that make it "awkward" to have it as an optional dependency don't sound that convincing to me, personally. I think it is quite easy to have helper methods that can check for pyarrow arrays and scalars (I commented on the mentioned PR: https://github.com/pandas-dev/pandas/pull/52076/files#r1160653924). That should already cover the first two points in the top post.
Is there some discussion about somewhere? I don't directly see why this would give a big slowdown. When choosing to make it a hard requirement now, we have to be aware what the consequences are installation wise. For example, you can currently install pandas from source on most platforms as long as you have a compiler available (apart from that, we only have python dependencies / dependencies that can be installed by python in the same way as pandas). With pyarrow, that would no longer be true, as it cannot be installed from source. A typical example are docker images based on alpine linux (musl), for which there are no wheels at the moment. |
+1 |
PDEP-1 first revision (scope) #51417 looks like it has moreless been accepted. The new text adds "New required dependency." as "Clear examples for potential PDEPs". However, IIUC this does not explicitly say a PDEP is required. There is the following... "On the other hand, if an issue becomes controversial, i.e. it generated a significant While this issue has not (here) "generated a significant discussion" a PDEP would be useful to "making it easier for the wider community to participate" |
agree - that, and the time bound on the voting period ensures we'll actually reach a decision rather than leaving the issue hanging around indefinitely. I think it'd be a good candidate |
Making a PDEP shifts decision-making away from people who actually work on development and towards people willing to throw up walls of text. We have five +1s, one -1, and one "no objections". Lets do this. |
-1. I don't agree that we should dismiss @jorisvandenbossche concerns. |
I would be +1 on making pyarrow a required dependency. Though not strictly necessary for default functionality at the moment, I think it would be a great "signal" for pandas to vouch for the arrow format & ecosystem. I can see this change being significant enough for a PDEP, but as @MarcoGorelli alluded too I expect we could vote on this rather quickly given all the support in this thread already |
yes. I think @jorisvandenbossche has convincing arguments to counter the issues in the OP and also partly the issues in older issue. from the PDEP guidelines... "It is not always trivial to know which issue has enough scope to require the full PDEP process. there does appear to be "sufficient consensus among the core team" but since requiring a new dependancy does not satisfy "minimal impact on the community.", I think a PDEP should be required for this change. |
I agree with @simonjayhawkins, PDEP-1 seems to indicate this is in scope. |
Just to clarify my comment, I do think it is a good idea that this is a PDEP such that we can lay out the pros/cons and discuss and use voting to make a decision with clarity than let the decision be ambiguous and linger here in the issue. Post the dev meeting on 2023-04-12, I will draft this PDEP |
As the person who wrote PDEP-1, I'd personally wouldn't use it to decide if something requires a PDEP or not. I'm happy with both having this as a PDEP or not. But I think more important than how we interpret PDEP-1 exact sentences is what's useful, and in this case I'm not sure what part of the text of the possible PDEP would be to comment on. Seems like the decision is more of a preference. Depending on pyarrow makes our life easier. And makes users who 1) want to install pandas from source 2) are not interested in the pyarrow feature, not be able to do it anymore. I don't think Joris comment is a -1 (maybe that's his position but the comment is just a be aware of the consequences). So, feels to me that we're all more or less in the same page. It's just a question of each person preference/priority. My vote is a clear +1, advantages of ensuring pyarrow is always available are more valuable than the limitations to the installation that I'm unsure any user will really face. This seems to be the most popular opinion based on above answers. I'd say, if there is no proper objection let's just move forward. If people think it's still not clear enough what are tha advantages and disadvantages let's write a PDEP, so we're all in the same page, and then make a decision (is the new voting system already in place, or should we vote for that too?). |
The other reason to write a PDEP is that it makes it possible to get user feedback. Since requiring pyarrow will impact users, we should see if our user community has an opinion on the matter. |
Do you think a PDEP has a much more visibility than this issue? We have them listed in the roadmap, I guess maybe a bit, but wouldn't an email to pandas-dev about this be a better option if we want more feedback? I'm surely happy with a PDEP, the only thing is the time it takes to write it. But I see that a PDEP/PR adds a lot of value when feedback about the specific text written in the PDEP is expected. And in this case, assuming everybody has already an understanding and an opinion about requiring pyartow, I don't expect so many useful comments for particular parts of the text, so discussing in normal issue comments seem as fine. The blocker here seems to be how we make the final decision (votes...) and I don't think a PDEP is going to help, so I'd address that directly. |
I don't think this should be a PDEP. It would make this process way more drawn out than it should be, and there's really not a lot to put in the PDEP (we would really just be talked about min. version of pyarrow, how many versions to support, etc., basically the stuff I listed above ). I missed the dev meeting where this was agreed upon, but I think I would much rather have a PDEP explaining the rationale behind adding support for pyarrow to everything in general (which would explain this by proxy). |
FWIW, there are no numpy musllinux wheels either. While numpy can be installed with no external (non-Python) deps other than the C compiler, I don't think anyone would actually install numpy like that, since you'll be missing things like the BLAS. (I would probably guess that people are building a numpy wheel for themselves that is compiled with the BLAS, and then installing that). |
The PDEP would have a discussion period, and be broadcast to the pandas user community as "open for discussion". We agreed at the dev meeting today that @mroeschke will draft a PDEP. We can have discussion there, and then formalize the vote. I have a concern about package bloat. Adding pyarrow increases the sizes of containers used in Kubernetes that are dependent on pandas. Not necessarily a reason to not require |
if u r using pandas you almost always are using parquet so likely you have pyarrow installed anyways |
Ah the age old problem - to include or not to include? As things stand today, it doesn't appear (according to pypistats, for as good as that can be) that pyarrow and pandas are always bundled together. And from our user polls I vaguely recall excel / html being pretty popular formats, maybe moreso than parquet. So I am a bit wary of using "likeliness of bundled usage" as a guiding factor As things stand now I am -1 on the issue and agree this should be a PDEP, for the following reasons:
There is a lot of nuance to this discussion and I agree a PDEP would make it easier to communicate and collaborate versus a huge wall of responses in this issue. Not everything needs to be a PDEP so definitely want to be cognizant of using it as a blocker to progress, but at the same time we just launched that process so I think we should be heavy on PDEPs the next few months, as frustrating as that can be |
On today's call there was a mostly-joking suggestion of making numpy optional. If someone finds a reasonable path to get there that would be neat (in large part bc making the numpy import lazy would be awesome). Until then, I think we should shy away from the term "backend", as I've seen multiple discussions (maybe reddit, HN, not sure) where people are confused by the proliferation of dtype options. I don't have links on hand, but I recall the word "backend" coming up frequently in these threads.
I've been assuming run-time. Other misc thoughts: The OP referenced gymnastics-free isinstance checks for construction. I'm also eager to have If we don't require pyarrow, I'd like to make the import lazy. It would also be nice to prune down to Just One String Dtype, and AFAICT pyarrow is the obvious choice there. |
Agreed by @jbrockmendel, we have to be careful communication wise. There is a lot of confusion out there regarding our arrow dtypes.
Sorry for not clarifying, yes run-time for now. Build-dependency would be nice at some point I guess, but there are a couple of hurdles to clear to get there.
Having something like bytes inferring as
This certainly isn't ideal, but it's not a blocker imo, especially if there are no NumPy wheels either. |
Another consideration of hard dependencies is their speed in releasing wheels on new platforms. I just checked and from what I can see pyarrow only released the first wheels for 3.11 in late November, which was a full month after pandas had a release with 3.11 support baked in. Building pyarrow on some platforms is very hard, at least compared to building pandas and its dependencies (incl NumPy which is easy to configure for OpenBlas even on Windows, and can be trivially compiled without Blas if not needed). |
Quick search brought the following issue: https://issues.apache.org/jira/browse/ARROW-17487 This looks more like a timing problem not like a general Python 3.11 support problem (like numba has right now) |
Thanks @jbrockmendel for some further justification as only the arguments in the OP at first did not appear strong enough and had IMO been countered by @jorisvandenbossche. Maybe the choice of/ future plans for string dtype warrants a detailed discussion (separate issue) and consensus what to do here? As I could potentially see this being a strong justification. (If we were to somehow able to remove object dtype from pandas)
My thinking was that we should not let preferences sway a (hasty) decision and have clear documented justification. Hence +1 for a PDEP.
Fair point. A more objective, maybe quantative, definition of "more valuable" would surely strengthen the argument in favor? For instance @WillAyd mentioned pypistats and my thinking, related to PDEP-9, is how many packages (and how important to the ecosystem) would say just use pandas to provide interoperability and not functionality? |
Going into more detail on my comment above (#52509 (comment)) about "consequences installation wise". Installation platform support But if, for some reason, there is no wheel available, a Some example cases I currently think of (but there are quitely more) for which there are no wheels and installation from source is needed:
And then there is also a part of the world that doesn't use pip/pypi/wheels (corporate environments with their own index, people using other (system) package managers (eg various linux package managers that have numpy/pandas but potentially not pyarrow), ..). But not going into detail here, because that's something I don't know much about. Installation size (as a comparison, polars (which also includes an arrow implementation) is 55MB) I don't know to what extent this is an actual problem (for example, is this nowadays still an issue for AWS Lambda?), but at least something to be aware of. It's also something where we would like to improve pyarrow, and there are some preliminary efforts to make the wheel smaller / split off optional components in separate wheels. |
xref #52429 |
Then adding a Jax backend would be awesome. |
We recently enabled jupyter lite on our examples on the scikit-learn side, and by default we're including pandas. Looking at the size of the wheels for pyarrow, they're somewhere between 20-35MB based on the platform. That's a significant increase for users who would be loading the example in their browser. We also have been having trouble with the latest pandas when pip installed in a mamba environment, since conda-lock fails in this case, and that's because pandas=2.0 requires tzdata, and the env ends up with two tzdata. It's not a pandas bug per say, but it complicates workflows (we're sticking with an older pandas for now for that env). That is to say even a small dependency like tzdata which should be present in most environments anyway, complicates things. Pandas is a library on which, other libraries depend, and any dependency added here adds to the dependency tree to those environments, and pyarrow is by far not a trivial dependency. All of that is to say, from my perspective, it would be a pity if pyarrow becomes a mandatory dependency. |
thanks @adrinjalali for sharing your perspective! the conversation has mostly moved to #52711, I'd suggest commenting there for visibility |
Maybe something like "pyarrow compatible interface is runtime required" would make more sense? Eg using the Go Arrow implementation, Arrow Rust or Arrow2 Rust implementation as a drop in replacement could help with some of the issues (platform support, binary sizes). There is already an Array API and dataframe API. Specifying a "ColumnStorage API" which different backends can implement doesn't sound bad (while we can also mark the different Arrow backends so polars and other tools can have assumptions about the data layout over the more generic ColumnStorage API) My understanding was that it is the design goal of Arrow to set a few reasonable assumptions (columnar storage, masked arrays) and provide a data format + data contract. Pyarrow provides definitely more than this: datasets, pyarrow.compute, pyarrow.parquet. Perhaps Pandas should start relying on the core functionality, but not the rest? This would require a |
Part of the rationale for using pyarrow as a runtime dependency is to reduce pandas core development burdens - sure using just a core part of arrow might be nice but it's not that helpful in addressing one of the most important reasons to have a required dependency. The point is to take advantage of the eco system w/o have even more in built functionality in pandas. |
In the PDEP #52711 (where we really should be having this discussion), I have asked here to get a better understanding of that burden (see #52711 (review)) |
(My comment was moved to #54466 (comment).) |
Right now we run into a bunch of problems because pyarrow is potentially missing.
This is the next step in supporting arrow fully in pandas. There are some things we can’t implement properly without requiring arrow as dependency.
So I think we should do it soonish, so that we can move forward.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: