-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISC: Consider not requiring PyArrow in 3.0 #57073
Comments
TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0. A few thoughts:
|
I think we should still stick with PDEP-10 as is; even if user benefit 1 wasn't as drastic as envisioned, I still think benefits 2 and 3 help immensely. Generally the story around pandas type system is very confusing; I am hopeful that moving towards the arrow type system solves that over time |
Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with). If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves). |
+1 on this as well. IMO, it's too early to require numpy 2.0 (since it's pretty hard to adapt to the changes). cc @pandas-dev/pandas-core |
+1 on not requiring numpy 2 for pandas 3. I'm fine to continue as planned with the PDEP. If we consider the option of another Arrow implementation replacing PyArrow, ot feels like using Arrow-rs is a better option than nanoarrow to me (at least an option also worth considering). Last time this was discussed it wasn't clear what would happen with the two Rust implementations, but now everybody (except Polars for now) is settled on Arrow-rs and Arrow2 is discontinued. So, things are stable. If there is interest, I can research further and work on a prototype. |
I think we should wait for more feedback in #54466 . pandas 2.2 was released only 11 days ago. I say we give it a month, or maybe until the end of February, and then make a decision. The whole point of gaining feedback was to give us a chance to revisit the decision to make
Given the feedback so far and the arguments that @MarcoGorelli gives above and the other comments, I'm leaning towards (3), but I'd like to see more feedback from the community at large. |
IMO, I think we should make a decision by the next dev call(Feb. 7th I think?). I'm probably going to release 2.2.1 at most 2 weeks after numpy releases the 2.0rc (so probably around Feb. 14, assuming the numpy 2.0 releases on schedule on Feb 1), and I think we should decide whether to roll back the warning for 2.2.1, to avoid confusion. |
I did a quick test on how big it'd be a binary using Arrow-rs (Rust). In general in Rust only static linking is used, so just one Something that can make this happen quicker and with less effort is implementing the same PyArrow API for Arrow-rs for the parts we need. In theory, that would allow to simply replace PyArrow by the new package and update the imports. If there is interest in giving this a try, I'd personally change my vote here from requiring PyArrow in pandas 3, to keep the status quo for now. |
I assume that the decision would be whether we plan to revise the PDEP and then go through the PDEP process again for the revised PDEP? The PDEP process was created not only that decisions have sufficient discussion and visibility but also that once agreed people could then work towards the agreed changes/improvements without being VETOd by individual maintainers. In this case, however, it maybe that several maintainers would vote differently now. Does our process allow us to re vote on the existing PDEP? (given that the PDEP did include the provision to collect feedback from the community) Does the outcome of any discussions/decisions on this affect whether the next pandas version is 3.0 or 2.3? |
Agree with Simon, this concern was discussed as part of the original PDEP (#52711 (comment)) with some timelines discussed and the vote was still approved. I somewhat expected some of the pushback from developers of web-apps so am supportive of this new proposal and my original vote, but it needs to fit in with the governance established, and should possibly also be cautious of any development that has taken place in H2 '23 that has been done in anticipation of the implementation of the PDEP. I would expect the approved PDEP to continue to steer the development until formally agreed otherwise. I don't see a reason why a new PDEP could not be proposed to alter/amend the previous, particularly if there already seemed to be enough support to warrant one. |
With @jorisvandenbossche idea I wanted to try and implement an Extension-Array compatable StringArray using nanoarrow. Some python idioms like negative indexing aren't yet implemented, and there was a limitation around classmethods I haven't worked around, but otherwise I did get implement this here: https://github.com/WillAyd/nanopandas/tree/7e333e25b1b4027e49b9d6ad2465591abf0c9b27 I also implented some of the optional interface items like Of course benchmarking this would take some effort, but I think most of the algorithms we would need are pretty simple. |
I too was keen to keep pyarrow optional but voted for the PDEP for the benefits for other dtypes. From the PDEP... "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object. Additionally, we will infer all dtypes that are listed below as well instead of storing as object."
IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns) However, I did not push this point once the proposal was expanded to dtypes other that strings.
Didn't we also discuss using, say nanoarrow? (Or am I mixing up the discussion on requiring pyarrow for the I/O interface.) If this wasn't discussed then a new/further discussion around this option would add value ( #57073 (comment)) especially since @WillAyd is actively working on this. |
Another advantage of building on top of nanoarrow is we would have the ability to implement our own algorithms to fit the needs of pandas. Here is a quick benchmark of the nanopandas In [3]: import nanopandas as nanopd
In [4]: import pandas as pd
In [5]: arr = nanopd.StringArray([None, "foo"] * 1_000_000)
In [6]: ser = pd.Series([None, "foo"] * 1_000_000, dtype="string[pyarrow]")
In [7]: arr.isna().to_pylist() == list(ser.isna())
Out[7]: True
In [8]: %timeit arr.isna()
10.7 µs ± 45 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [9]: %timeit ser.isna()
2 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) That's about a 200x speedup. Of course its not a fair comparison because the pandas arrow extension implementation calls |
going down this path is a tremendous amount of work - replicating pyarrow effectively this should not be taken lightly - the purpose of having pyarrow as a required dependency is to reduce overall project requirements |
The added workload is a very valid concern, though its definitely not on the scale of replicating pyarrow. We should just be using the nanoarrow API and not even managing memory since the nanoarrow C++ helpers can do that for you. |
While I'm surely +1 on replacing PyArrow by a better implementation, I guess the proposal is not to implement the string algorithms in nanoarrow and make this the default for pandas 3.0, right? So,I think in some months we can have pandas strings based in:
Besides the existing ones based in NumPy objects and PyArrow. To narrow the discussion, I think we need to decide somehow independently:
|
Yes, I too think that full implementations may be too ambitious and may not even be necessary (performance-wise). I would think that these implementations would only be needed as fallbacks if we were to u-turn on the pyarrow requirement so that we could move forward with defaulting to arrow memory backed arrays for the dtypes listed in the PDEP for pandas 3.0. The feedback from users is not unexpected and was discussed (other than the noise regarding warnings) As @attack68 said, "I would expect the approved PDEP to continue to steer the development until formally agreed otherwise.", i.e. through the PDEP revision procedure. |
Good idea to distinguish the shorter and longer term. But on the short term options:
No, as I mentioned above, IMO we already have an obvious fallback for pyarrow, which is the object-dtype backed StringDtype. So I think making that the default (if pyarrow is not installed) is another very reasonable option for pandas 3.0. (personally, for me the most important thing is that I want to see Simon brought up the point of having two implementations with slight differences in behaviour:
And that's certainly a valid downside of this (I seem to remember we have had issues with this in the past with different behaviour when numexpr or bottleneck was installed and "silently" used by default). This might also be unavoidable in general, for other data types as well. It seems likely that also for numeric data we will have a numpy-based and pyarrow-based implementation side by side for some time, and also there there will likely be slight differences in behaviour. |
Yes, it is all to easy for the discussion to go off at tangents and this issue was opened with the suggestion of requiring Numpy 2.0+ It appears there is no support for this at all? The other question raised was whether anyone would vote differently now. For this, however, it does appear that several maintainers would. For those that would, it would be interesting to know explicitly what changes to the PDEP would be expected. Or to keep the status quo, we would somehow need a re-vote on the current PDEP revision. to be clear, without any changes to the PDEP, I would not change my vote. I do not regret the decision since my decision was based on better data types other than just strings and discussions around a better string data type do not fully address this. |
Unless we want to change the timeline for 3.0 (or delay the introduction of the string dtype to a later pandas release), I think it's not very realistic. To start, this change hasn't even landed yet on numpy main. I think it will also be annoying for pandas to strictly requires numpy >= 2.0 that soon (given numpy 2.0 itself is also a breaking release). Further, numpy only implements a subset of string kernels (for now), so even when using numpy for the memory layout, we will still need a fallback to Python objects for quite some of our string methods. |
Sorry, I missed that option @jorisvandenbossche. I personally don't like using
For long term we have:
While I don't think we need a full rewrite of PyArrow, I think we need next things in any Arrow implementation we use to be functional (only string operations don't seem useful to me, as it'd still require PyArrow anyway to have a dataframe with Arrow columns):
I think for the nanoarrow approach we need all this, which does feel like almost rewriting PyArrow from scratch. Also, do we have a reason to think the new implementation will be smaller than Arrow? What do you think @WillAyd? Maybe I'm missing something here. While Arrow-rs doesn't have Rust bindings, Datafusion does. It seems to provide all or most of what we need to fully replace PyArrow. The Python package doesn't have dependencies and it requires 43Mb. Quite big, but less than half of PyArrow. The build should be just standard building, I think that was another issue with PyArrow. I think it's an option worth exploring. |
Don't worry ;) But can you clarify which cumbersome things users would have to learn? For the average user, whether we use a pyarrow string array or a numpy object array under the hood, that's exactly the same and you shouldn't notice that (except for performance differences, of course). |
I was rereading your original comment and I realize now that your initial proposal is to make the default the PyArrow string type, except when PyArrow is not installed, right? In the last comment sounded like you wanted to always default to the string object type, that's what I find complex to learn (considering what users already know about object...). String PyArrow type as default and string object as fallback seems like a reasonable trade off to me. |
yes. we had this discussion in the original PDEP starting around #52711 (comment) following on from a discussion around the NumPy string type and yet still voted (as a team) on requiring PyArrow as a required dependency. What I see as new to the discussion is considering using nanoarrow to instead provide some sort of fallback option, not the default. I see this could potentially address some of the concerns around data types other than strings eg. #52711 (comment) |
To be clear at no point was I suggesting we rewrite pyarrow; what I showed is simply an extension array that uses arrow native storage. That is a much smaller scope than what some of the discussions here have veered towards I don't think any of the arguments raised in this discussion are a surprise and I still vote to stick with the PDEP. I think if that got abandoned but we still wanted Arrow strings without pyarrow then the extension array I showcased above may be a reasonable fallback and may even be easier to integrate than a "string[object]" fallback because at the raw storage level it fits the same mold as a pyarrow string array |
Thanks @WillAyd for elaborating. I think if the PDEP was revised to include something like this (not requiring pyarrow but if not installed defaulting to pyarrow memory backed array on array construction but limited functionality and advising users to install pyarrow) I would perhaps vote differently now. So agree that, at this point in time, these alternatives perhaps only need discussion if enough people are strongly against requiring pyarrow as planned. |
Lukewarm is warm enough for me if that allows us to move forward ;) To note, compared to 2.0 there is no step back though: for the many users having pyarrow, pandas 3.0 will be a big step forward for string handling, for users without pyarrow it still is a nice step forward in having a proper and strict, dedicated string dtype (I know you are comparing it to the case of requiring Arrow, but I think the comparison from where users are right now is more important)
That will be interesting to see! As I mentioned above, I am very interested in further exploring alternatives on the longer term, and we certainly should consider the numpy string dtype there as well. But for the purpose of this decision right now for 3.0, IMO we can't take it that much into account (I personally think it is not that likely it will be fully ready for 3.0, but even if ready on time then we cannot use it as the sole alternative given its numpy version requirement, so if we want to make pyarrow only a soft dependency for the string dtype, we still need the numpy object-dtype based alternative anyway short term). |
Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing... |
To be clear, this will probably be the eventual replacement for the object dtyped based numpy string array, since the new numpy string ufuncs should match the semantics for the Python string methods. So, we'll still end up with 2 string dtypes (eventually). As long as numpy is a hard dependency, we will probably want some sort of native numpy string dtype (since it wouldn't be ideal to copy numpy strings supplied by an user to object dtype or an Arrow dtype). |
I agree. There seemed to be hesitation/concern in deviating (revoking parts of) from the agreed PDEP when it came to revoking the warning (#57424 (comment)) and yet there IMHO has been a total deviation from the agreed PDEP with the implementation of a new string dtype with NumPy semantics (#54792). I think this change was worthy of a PDEP in itself. Surely, the use of PyArrow and our Extension dtypes was to improve on (deviate away from) NumPy NaN semantics to a more consistent missing value indicator. I fail to see how this new string dtype (and the new fallback) is a long term benefit or aligns with one the original goals of PDEP-10 which was claimed to reduce the maintenance burden. |
It is unfortunate that this wasn't properly discussed at the time of the actual PDEP (I don't remember any discussion about it (but I can certainly be misremembering, didn't check), and the PDEP text itself also just says to use "string[pyarrow]" / "the more efficient pyarrow string type" that has been available since 1.2, without mentioning anything about the consequences of this choice). I understand that others might find this a stretch, but given our complete lack of even mentioning null semantics at the time, personally I am interpreting the PDEP as using "a string dtype backed by pyarrow". Otherwise this means a silent and implicit change of the default null semantics for our users, while that is a change that would definitely warrant its own PDEP (and which is coming). For lack of specifying this, preserving the default null semantics seems the better default position to me. I think the discussion in #54792 already gives some context around why we think it is important to preserve the default dtypes and null semantics for now (and specifically in #54792 (comment), where I try to explain that this is meant to make the change less confusing for users) |
That was my understanding also that we would use the dtype that had be tried and tested, had been available to users for a while and that conformed to what i like to call the pandas string API (distinguishing the python backed and pyarrow backed extension arrays from the default object dtype) which included the pandas missing value indicator and returned pandas nullable dtypes where possible.
yes, I can see that this omission is maybe another reason why we may need to revise (re-vote on) PDEP-10 if people think they would have voted differently with this being more explicitly outlined.
Maintaining the status quo is probably a good position when there isn't consensus how to proceed. However, if we are not requiring PyArrow for 3.0 then maybe we are not yet ready to change the default string dtype. It is also better support for other dtypes that requiring PyArrow would have given. I think that PDEP-13 maybe expects PyArrow to be required? |
Can you explain why you think this is the case? |
Well, where on the roadmap do we transition to the pandas string API or is this new string dtype considered a long term solution. Also, why do we need a fallback #58451? I assume this is only needed if PyArrow is not a required dependency and if that is the case PDEP-10 is voided, including the change in string dtype, as we cannot deliver the other dtype improvements either? |
The current StringDtype enabled by |
So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas? |
I would personally prefer that the numpy string semantics be (eventually) left to the numpy StringDType and that Arrow strings should have Arrow semantics. |
No, it is the current, potentially temporary solution for pyarrow-backed arrays which are enabled by default (so right now this is only the pyarrow-backed string dtype). And the "potentially temporary" part is dependent on a separate discussion about how to move forward to nullable dtypes in general (not specific to strings).
To be clear this is not about "string semantics". The string semantics for the different variants of our StringDtype are all the same. It is about the missing value sentinel, which is not string specific. And I will argue that long term we want to have the same missing value semantics regardless of whether a dtype is backed by numpy or pyarrow. But again that is for a separate discussion (beyond 3.0). It is unfortunate that we haven't resolved the missing values discussion before doing the string dtype, causing those additional dtypes and not-completely-orthogonal discussions that are hard to separate, but that is the reality we have to live with. |
Ah ok. So the plan is still to move towards a StringDtype backed by pd.NA if I'm understanding you correctly? Is part of the hesitation to go directly to a StringDtype with pd.NA as the missing value because we still support doing string operations (e.g. using the If I'm following correctly, right now we have the in terms of string arrays/dtypes, Existing StringArray (uses object dtype - essentially "python" backed,uses pd.NA, returns nullable boolean/integer arrays in some operations) Proposed
where If the plan long term is to move to |
Thanks for clarifying. This is why I think that maybe we are not yet ready. We make a breaking change that is considered temporary. I'm not really in favor of this but could have lived with it as a transition if we were making PyArrow a required dependency in 3.0. (IIRC it was already introduced without a PDEP and without cc the core team on the discussions until after the work had been done and some parts already merged). I can also see that without PyArrow as a required dependency, continuing with this new string data type gives some significant performance improvements that are equivalent to those that were initially used to partially justify having PyArrow as a required dependency. I'm not sure if others would agree, whether a retrospective PDEP for this dtype change would be beneficial so that the change has more eyes on it. Without a PDEP, and without having PyArrow as a required dependency, maybe the new string type could remain behind the future flag for now? |
Yes (users might still have to do some code changes just because it is no longer object dtype but a string dtype, but users shouldn't need to make code changes related to missing values because of the new string dtype)
If we get agreement on that long term plan for pd.NA, then yes (but the same applies to all our default dtypes right now that use NaN/NaT as missing value sentinel, also all those other dtypes will need a transition, so this is not specific to those new string dtypes)
The "breaking" aspect (the fact that it will be a We are not ready to make a string dtype + NA the default, but IMO we are perfectly ready to just make a string dtype the default (without missing value change). |
This seems reasonable on the proviso that the fallback #58451 behaves exactly the same (and this PR is not yet ready, in your own words "There are a bunch more of such changes needed ... to properly handle the case without pyarrow installed.", otherwise we don't address @jbrockmendel original concerns about different behaviors. We should also probably remove the pyarrow_numpy naming, which I know you were uncomfortable with. This very naming suggests some sort of Frankenstein dtype. as @WillAyd states in #54792 (comment) the additional dtype could create more user confusion. So for the user they should probably only see "String" and we keep the pyarrow implementation a hidden detail. I guess this would make the current breaking change, to a pandas native string dtype and a future breaking change to using pd.NA (either in conjunction or separate from returning pandas nullable types from operations such as str.len) more palatable. Either way, I really think a PDEP was needed and it maybe not to late to do this? There are significant design decisions that have to date only been discussed between a small group and the more eyes the more likely a better (or the best) solution will be delivered. |
The new string dtype (PyArrow backed with NumPy schematics) that is being proposed for 3.0 (#57073 (comment)) was originally incorporated into pandas when it was assumed that PyArrow was going to become a required dependency. The logic for this was it was felt that the consequences of having it use the pandas missing value indicator and returning other pandas nullable types had not been discussed properly in the PDEP (#57073 (comment)) The idea of a fallback, so that PyArrow was not needed to be a required dependency, had some pushback in the original PDEP discussion, due to both behavioral differences and performance concerns. "Offering it to users is doing them a disservice." was one comment and "Personally, I don't see a point for the string[python] implementation anymore. Non expert users end up with this dtype more often than not and wonder why it's slow." was another. This issue was originally open with the title "DISC: Don't require PyArrow in pandas 3.0, require NumPy 2.0+, use NumPy's StringDType". Having NumPy 2.0 as a required dependency was dismissed but at the time the fallback option was not on the table, so this option was rejected. Time has moved on and we are now again proposing a fallback option. As @jorisvandenbossche mentions, we need an agreement on the long term plan for pd.NA for all data types so that the default pandas types are the pandas nullable types. There are some comments here and elsewhere regarding the confusion of mixing the NumPy and PyArrow semantics. (The term PyArrow is to some extent interchangeable with pandas nullable types and NumPy is to some extent interchangable with pandas current type system). If we cannot get away from having a fallback option then a native string type for pandas using the current type system (NumPy Semantics) could maybe be best achieved using the new Numpy string dtype. Finally, #57073 (comment), does not address the other dtype improvements that we planned to deliver. So personally, I disagree that the proposal honors the gist of the PDEP. |
Assuming that the fallback option could be removed once PyArrow becomes a required dependency if using the proposed new string dtype or could be removed once the minimum required version of Numpy is 2.0 if we used the new NumPy string dtype for a default pandas native dtype. In general what do people expect these timefames to be. If we introduce a temporary transitional dtype that somehow remains the default for a longer term. Also, once we have agreed on moving forward with the pandas nullable types as the defaults, do we expect the pandas current type system to remain for backwards compatibility? Do we then deprecate the pyarrow_numpy string dtype, keep it, or replace with the new Numpy native String dtype. |
I guess the next steps here would be to: A) Formally reject/accept PDEP-10. I will open a PR to change the status of PDEP-10 to rejected, and we should vote on that. If that vote fails, I guess we'll just have to go with accepting pyarrow as a dependency. B) Specifically open a PDEP about untangling the string dtypes What we should cover in this PDEP are to
C) Push nullable dtypes to be default in some future version of pandas. This might be a topic fit for PDEP-13. Did I summarize the conversation here accurately? |
If my attempt to find consensus for a compromise about how to move forward on the short term for 3.0 (#57073 (comment)) doesn't work out, then yes a PDEP will help and I am happy to write one, specifically for what I proposed above. Personally I would only do the effort of writing it in case of 1) the scope is limited to just what to do for a default string dtype in pandas 3.0, and 2) if we can actually move forward with the implementation in parallel so that if the PDEP gets approved we are better prepared to do a 3.0 release (we have plenty of prior cases where we have had features behind an option flag before it was officially approved to make it the default, and also we actually already have this in released pandas for the pyarrow-backed version of the dtype). |
@lithomas1 sorry, our posts crossed. I am personally fine with your top-level items, but as I mentioned in my previous post, I would personally keep a PDEP on the string dtype more limited (deprecating string methods for object dtype can be considered later, what to do when NA becomes the default can be left for the PDEP about NA, and I don't think we need to make a final decision on the usage of np.StringDType for pandas 3.0 (as IMO that also doesn't influence what we should do for pandas 3.0)). |
My attempt at writing a PDEP for this -> #58551 |
Sure. That's a great idea. We do not need to agree on, just give due consideration to the bigger picture and also start reviewing the fallback PR and getting parts merged without any blocking as the final decision is gated behind the PDEP. Thanks @jorisvandenbossche for putting the PDEP together. I am much more comfortable now, especially since PDEP-1 explicitly mentions Adding a new data type has impact on a variety of places that need to handle the data type. Such wide-ranging impact would require a PDEP. I did feel that PDEP-10 had not made provision for deviating from the existing String types and also if we do not make PyArrow a required dependency we would need to revoke PDEP-10 and so would not have any formal agreement for implementing a native string type in 3.0. |
TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.
Background
In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to
object
.This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.
StringDType in NumPy
There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:
Let's tackle these in turn:
I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).
The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:
So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.
There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.
For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for
.str.partition
). There is real funding behind this work, so it's likely to keep moving quite fast.Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.
Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.
Feedback
The feedback issue makes for an interesting read: #54466.
Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).
This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.
Tradeoffs
In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?
Furthermore, not requiring PyArrow would mean not being able to infer
list
andstruct
dtypes by default (at least, not without significant further work)."No is temporary, yes is forever"
I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.
I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).
Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?
The text was updated successfully, but these errors were encountered: