Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: Consider not requiring PyArrow in 3.0 #57073

Open
MarcoGorelli opened this issue Jan 25, 2024 · 68 comments
Open

DISC: Consider not requiring PyArrow in 3.0 #57073

MarcoGorelli opened this issue Jan 25, 2024 · 68 comments
Labels
Needs Discussion Requires discussion from core team before further action

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 25, 2024

TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.

Background

In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to object.
This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.

StringDType in NumPy

There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:

  1. NumPy itself might not accept its StringDType proposal.
  2. NumPy's StringDType might not come with the algorithms pandas needs.
  3. pyarrow's strings might still be significantly faster.
  4. because pandas typically supports older NumPy versions (in addition to the latest release), it would be 2+ years until pandas could use NumPy's strings.

Let's tackle these in turn:

  1. I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).

  2. The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:

    In addition, we will add implementations for the comparison operators as well as an add loop that accepts two string
    arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha,
    isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs [universal functions] that will be newly
    available in NumPy 2.0.

    So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.

    There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.

    For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for .str.partition). There is real funding behind this work, so it's likely to keep moving quite fast.

  3. Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.

  4. Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.

Feedback

The feedback issue makes for an interesting read: #54466.
Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).

This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.

Tradeoffs

In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?

Furthermore, not requiring PyArrow would mean not being able to infer list and struct dtypes by default (at least, not without significant further work).

"No is temporary, yes is forever"

I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.

I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).

Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?

@MarcoGorelli MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Jan 25, 2024
@mroeschke
Copy link
Member

TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0.

A few thoughts:

  1. numpy StringDtype will still be net new in 2.0. While I expect the new type to be robust and more performant than object, I think with any new feature it should be opt-in first before being made the default as the scope of edge case incompatibility is unknown. pyarrow strings have been around since 1.3 and was not until recently decided to become the default (I understand it's a different type system too).

  2. I have a biased belief that pyarrow type system with it's nullability and support for more types would be a net benefit for users, but I understand that the current numpy type system is "sufficient". It would be cool to allow users to use pyarrow types everywhere in pandas by default, but making that opt-in I think is a likely end state for pyarrow + pandas.

@WillAyd
Copy link
Member

WillAyd commented Jan 25, 2024

I think we should still stick with PDEP-10 as is; even if user benefit 1 wasn't as drastic as envisioned, I still think benefits 2 and 3 help immensely.

Generally the story around pandas type system is very confusing; I am hopeful that moving towards the arrow type system solves that over time

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 25, 2024

Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with).

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper string dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).
This of course requires a decent chunk of work in pandas itself. But with the advantages that this keeps compatibility with the Arrow type system (and zero-copy conversion to/from Arrow), and also already gives some advantages for the case pyarrow is not installed (improved memory usage, performance improvements for a subset of methods).

@lithomas1
Copy link
Member

TLDR: I am +1 not making pyarrow a required dependency in pandas 3.0. I am -1 on making NumPy 2.0 the min version and numpy StringDtypes the default in pandas 3.0. Keep the status quo in 3.0.

+1 on this as well. IMO, it's too early to require numpy 2.0 (since it's pretty hard to adapt to the changes).

cc @pandas-dev/pandas-core

@datapythonista
Copy link
Member

+1 on not requiring numpy 2 for pandas 3.

I'm fine to continue as planned with the PDEP. If we consider the option of another Arrow implementation replacing PyArrow, ot feels like using Arrow-rs is a better option than nanoarrow to me (at least an option also worth considering). Last time this was discussed it wasn't clear what would happen with the two Rust implementations, but now everybody (except Polars for now) is settled on Arrow-rs and Arrow2 is discontinued. So, things are stable.

If there is interest, I can research further and work on a prototype.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 30, 2024

I think we should wait for more feedback in #54466 . pandas 2.2 was released only 11 days ago. I say we give it a month, or maybe until the end of February, and then make a decision. The whole point of gaining feedback was to give us a chance to revisit the decision to make pyarrow a required dependency. Seems like our options at this point with pandas 3.0 are:

  1. Require pyarrow as planned from PDEP-10
  2. Require numpy 2.0 and use numpy implementation for strings.
  3. Postpone to a later date any requirement for pyarrow - make it optional but allow people to get better string performance by opting in.

Given the feedback so far and the arguments that @MarcoGorelli gives above and the other comments, I'm leaning towards (3), but I'd like to see more feedback from the community at large.

@lithomas1
Copy link
Member

IMO, I think we should make a decision by the next dev call(Feb. 7th I think?).

I'm probably going to release 2.2.1 at most 2 weeks after numpy releases the 2.0rc (so probably around Feb. 14, assuming the numpy 2.0 releases on schedule on Feb 1), and I think we should decide whether to roll back the warning for 2.2.1, to avoid confusion.

@datapythonista
Copy link
Member

I did a quick test on how big it'd be a binary using Arrow-rs (Rust). In general in Rust only static linking is used, so just one .so and no dependencies would be needed. A sample library using Arrow-rs with the default components (arrow-json, arrow-ipc...) compiles to a file around 500kb. In that sense, the Arrow-rs approach would solve the installation and size issues. Of course this is not an option for pandas 3.0, and it requires a non-trivial amount of work.

Something that can make this happen quicker and with less effort is implementing the same PyArrow API for Arrow-rs for the parts we need. In theory, that would allow to simply replace PyArrow by the new package and update the imports.

If there is interest in giving this a try, I'd personally change my vote here from requiring PyArrow in pandas 3, to keep the status quo for now.

@simonjayhawkins
Copy link
Member

IMO, I think we should make a decision by the next dev call(Feb. 7th I think?).

I assume that the decision would be whether we plan to revise the PDEP and then go through the PDEP process again for the revised PDEP?

The PDEP process was created not only that decisions have sufficient discussion and visibility but also that once agreed people could then work towards the agreed changes/improvements without being VETOd by individual maintainers.

In this case, however, it maybe that several maintainers would vote differently now.

Does our process allow us to re vote on the existing PDEP? (given that the PDEP did include the provision to collect feedback from the community)

Does the outcome of any discussions/decisions on this affect whether the next pandas version is 3.0 or 2.3?

@attack68
Copy link
Contributor

attack68 commented Feb 3, 2024

Agree with Simon, this concern was discussed as part of the original PDEP (#52711 (comment)) with some timelines discussed and the vote was still approved. I somewhat expected some of the pushback from developers of web-apps so am supportive of this new proposal and my original vote, but it needs to fit in with the governance established, and should possibly also be cautious of any development that has taken place in H2 '23 that has been done in anticipation of the implementation of the PDEP. I would expect the approved PDEP to continue to steer the development until formally agreed otherwise. I don't see a reason why a new PDEP could not be proposed to alter/amend the previous, particularly if there already seemed to be enough support to warrant one.

@WillAyd
Copy link
Member

WillAyd commented Feb 3, 2024

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).

With @jorisvandenbossche idea I wanted to try and implement an Extension-Array compatable StringArray using nanoarrow. Some python idioms like negative indexing aren't yet implemented, and there was a limitation around classmethods I haven't worked around, but otherwise I did get implement this here:

https://github.com/WillAyd/nanopandas/tree/7e333e25b1b4027e49b9d6ad2465591abf0c9b27

I also implented some of the optional interface items like unique, fillna and dropna alongside a few str accessor methods

Of course benchmarking this would take some effort, but I think most of the algorithms we would need are pretty simple.

@simonjayhawkins
Copy link
Member

Personally I am in favor of keeping pyarrow optional (although I voted for the PDEP, because I find it more important to have a proper string dtype). But I also agree with Matt that it seems to fast to require numpy >= 2 for pandas (not only because the string dtype is very new, but also just because this will be annoying for the ecosystem to require such a new version of numpy that many other packages will not yet be compatible with).

I too was keen to keep pyarrow optional but voted for the PDEP for the benefits for other dtypes.

From the PDEP... "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object. Additionally, we will infer all dtypes that are listed below as well instead of storing as object."

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. User still get the benefit of a new default, proper string dtype in 3.0 in all cases, but if they also want the performance improvements of the new string dtype, they need to have pyarrow installed. Then it's up to users to make that trade-off (and we can find ways to strongly encourage users to use pyarrow).

IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns)

However, I did not push this point once the proposal was expanded to dtypes other that strings.

I would also like to suggest another potential idea to consider: we could adopt Arrow's type (memory model) for strings, but without requiring pyarrow the package. Building on @WillAyd's work in #54506 using nanoarrow to use bitmaps in pandas, we could implement a basic StringArray in a similar way, and implement the basic required features in pandas itself (things like getitem, take, isna, unique/factorize), and then for the string-specific methods either use pyarrow if installed, or fallback to Python's string methods otherwise (or if we could vendor some code for this, progressively implement some string methods ourselves).

Didn't we also discuss using, say nanoarrow? (Or am I mixing up the discussion on requiring pyarrow for the I/O interface.)

If this wasn't discussed then a new/further discussion around this option would add value ( #57073 (comment)) especially since @WillAyd is actively working on this.

@WillAyd
Copy link
Member

WillAyd commented Feb 5, 2024

Another advantage of building on top of nanoarrow is we would have the ability to implement our own algorithms to fit the needs of pandas. Here is a quick benchmark of the nanopandas isna() implementation versus pandas:

In [3]: import nanopandas as nanopd

In [4]: import pandas as pd

In [5]: arr = nanopd.StringArray([None, "foo"] * 1_000_000)

In [6]: ser = pd.Series([None, "foo"] * 1_000_000, dtype="string[pyarrow]")

In [7]: arr.isna().to_pylist() == list(ser.isna())
Out[7]: True

In [8]: %timeit arr.isna()
10.7 µs ± 45 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit ser.isna()
2 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That's about a 200x speedup. Of course its not a fair comparison because the pandas arrow extension implementation calls to_numpy(), but we in theory would have more flexibility to avoid that copy to numpy if we take on more management of the lower level.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2024

going down this path is a tremendous amount of work - replicating pyarrow effectively

this should not be taken lightly - the purpose of having pyarrow as a required dependency is to reduce overall project requirements

@WillAyd
Copy link
Member

WillAyd commented Feb 5, 2024

The added workload is a very valid concern, though its definitely not on the scale of replicating pyarrow. We should just be using the nanoarrow API and not even managing memory since the nanoarrow C++ helpers can do that for you.

@datapythonista
Copy link
Member

While I'm surely +1 on replacing PyArrow by a better implementation, I guess the proposal is not to implement the string algorithms in nanoarrow and make this the default for pandas 3.0, right?

So,I think in some months we can have pandas strings based in:

  • NumPy 2
  • nanoarrow
  • Arrow-rs

Besides the existing ones based in NumPy objects and PyArrow.

To narrow the discussion, I think we need to decide somehow independently:

  • What do we do for pandas 3? I think the only reasonable options are require PyArrow and have pyarrow strings by default, or keep the status quo
  • What do we do long term? Do we commit to any of those implementations (pyarrow, numpy2, nanoarrow...)? Would it make sense to implement those as extensions / separate packages and make the default depend on what's installed? Like having a pandas metapackage that depends on pandas-core and pandas-pyarrow, so we would have users by default use pyarrow string types and users on wasm and lambda and anyone else who cares can install pandas-core and cherrypick dependencies.

@simonjayhawkins
Copy link
Member

Yes, I too think that full implementations may be too ambitious and may not even be necessary (performance-wise). I would think that these implementations would only be needed as fallbacks if we were to u-turn on the pyarrow requirement so that we could move forward with defaulting to arrow memory backed arrays for the dtypes listed in the PDEP for pandas 3.0.

The feedback from users is not unexpected and was discussed (other than the noise regarding warnings)

As @attack68 said, "I would expect the approved PDEP to continue to steer the development until formally agreed otherwise.", i.e. through the PDEP revision procedure.

@jorisvandenbossche
Copy link
Member

Good idea to distinguish the shorter and longer term. But on the short term options:

What do we do for pandas 3? I think the only reasonable options are require PyArrow and have pyarrow strings by default, or keep the status quo

No, as I mentioned above, IMO we already have an obvious fallback for pyarrow, which is the object-dtype backed StringDtype. So I think making that the default (if pyarrow is not installed) is another very reasonable option for pandas 3.0.

(personally, for me the most important thing is that I want to see string in the df.dtypes for columns with strings, so that we can stop explaining to users "if you see "object" that's probably a column with strings". How it's exactly implemented under the hood is more of a detail, although I certainly want a more performant implementation as well).

Simon brought up the point of having two implementations with slight differences in behaviour:

If we want a simple alternative to keep pyarrow optional, I don't think we need to use numpy's new string dtype, though. We already have a object-dtype based StringDtype that can be the fallback when pyarrow is not installed. ...

IIRC I also made this point in the original discussion but there was pushback to having the object backed StringDType as the default if pyarrow is not installed that included not only concerns about performance but also regarding different behavior depending on if a dependency was installed. (The timelines for NumPy's StringDtype precluded that as an option to address the performance concerns)

And that's certainly a valid downside of this (I seem to remember we have had issues with this in the past with different behaviour when numexpr or bottleneck was installed and "silently" used by default).
I do wonder however if we have an idea of whether there are many behaviour differences we are aware of from testing and user reports of the arrow-backed StringDtype over the last years (I know one reported to pyarrow about a different upper case for ß in apache/arrow#34599). I don't know if we have some skips/special cases in our tests because of behaviour differences.

This might also be unavoidable in general, for other data types as well. It seems likely that also for numeric data we will have a numpy-based and pyarrow-based implementation side by side for some time, and also there there will likely be slight differences in behaviour.

@simonjayhawkins
Copy link
Member

  • What do we do for pandas 3? I think the only reasonable options are require PyArrow and have pyarrow strings by default, or keep the status quo

Yes, it is all to easy for the discussion to go off at tangents and this issue was opened with the suggestion of requiring Numpy 2.0+

It appears there is no support for this at all?

The other question raised was whether anyone would vote differently now. For this, however, it does appear that several maintainers would. For those that would, it would be interesting to know explicitly what changes to the PDEP would be expected.

Or to keep the status quo, we would somehow need a re-vote on the current PDEP revision.

to be clear, without any changes to the PDEP, I would not change my vote. I do not regret the decision since my decision was based on better data types other than just strings and discussions around a better string data type do not fully address this.

@jorisvandenbossche
Copy link
Member

this issue was opened with the suggestion of requiring Numpy 2.0+. It appears there is no support for this at all?

Unless we want to change the timeline for 3.0 (or delay the introduction of the string dtype to a later pandas release), I think it's not very realistic. To start, this change hasn't even landed yet on numpy main. I think it will also be annoying for pandas to strictly requires numpy >= 2.0 that soon (given numpy 2.0 itself is also a breaking release). Further, numpy only implements a subset of string kernels (for now), so even when using numpy for the memory layout, we will still need a fallback to Python objects for quite some of our string methods.
Given the last item, we also would want to keep the option to use PyArrow for strings as well, resulting in this double implementation anyway (with the possible behaviour differences). At that point, I think the easier option is to use the object-dtype fallback instead of a new numpy-2 based fallback.

@datapythonista
Copy link
Member

Sorry, I missed that option @jorisvandenbossche. I personally don't like using string[object] by default, it doesn't add value in functionality and performance, and makes users have to learn more cumbersome things. But it's an option, so for pandas 3 we have:

  1. Continue with the agreed PDEP and require PyArrow
  2. "Cancel" the PDED and continue with the object type
  3. Use the string dtype backed by NumPy objects

For long term we have:

  • NumPy 2 (I don't know much about this, but I guess we don't need an extra dependency, but we lose the nice things about Arrow, like interoperability)
  • Nanoarrow
  • Arrow-rs

While I don't think we need a full rewrite of PyArrow, I think we need next things in any Arrow implementation we use to be functional (only string operations don't seem useful to me, as it'd still require PyArrow anyway to have a dataframe with Arrow columns):

  • Convert NumPy columns to Arrow
  • Build Arrow arrays from Python lists
  • Arrow CSV, Parquet... loaders

I think for the nanoarrow approach we need all this, which does feel like almost rewriting PyArrow from scratch. Also, do we have a reason to think the new implementation will be smaller than Arrow? What do you think @WillAyd? Maybe I'm missing something here.

While Arrow-rs doesn't have Rust bindings, Datafusion does. It seems to provide all or most of what we need to fully replace PyArrow. The Python package doesn't have dependencies and it requires 43Mb. Quite big, but less than half of PyArrow. The build should be just standard building, I think that was another issue with PyArrow. I think it's an option worth exploring.

@MarcoGorelli MarcoGorelli changed the title DISC: Don't require PyArrow in pandas 3.0, require NumPy 2.0+, use NumPy's StringDType DISC: Consider to requiring PyArrow in 3.0 Feb 5, 2024
@jorisvandenbossche
Copy link
Member

Sorry, I missed that option @jorisvandenbossche. I personally don't like using string[object] by default, it doesn't add value in functionality and performance, and makes users have to learn more cumbersome things.

Don't worry ;) But can you clarify which cumbersome things users would have to learn? For the average user, whether we use a pyarrow string array or a numpy object array under the hood, that's exactly the same and you shouldn't notice that (except for performance differences, of course).
While it indeed doesn't give any performance benefits, IMO it gives a big functionality advantage in simply having a string dtype, compared the current catch-all object dtype (that's one of the main reasons we added this object-dtype based StringDtype already in pandas 1.0, https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#dedicated-string-data-type). Functionality-wise, there is actually basically no difference between the object or pyarrow based StringArray (with the exception of a few corner cases where pyarrow doesn't have an implementation and the pyarrow-backed array still falls back to Python).

@datapythonista
Copy link
Member

I was rereading your original comment and I realize now that your initial proposal is to make the default the PyArrow string type, except when PyArrow is not installed, right? In the last comment sounded like you wanted to always default to the string object type, that's what I find complex to learn (considering what users already know about object...).

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

@simonjayhawkins
Copy link
Member

String PyArrow type as default and string object as fallback seems like a reasonable trade off to me.

yes. we had this discussion in the original PDEP starting around #52711 (comment) following on from a discussion around the NumPy string type and yet still voted (as a team) on requiring PyArrow as a required dependency.

What I see as new to the discussion is considering using nanoarrow to instead provide some sort of fallback option, not the default.

I see this could potentially address some of the concerns around data types other than strings eg. #52711 (comment)

@MarcoGorelli MarcoGorelli changed the title DISC: Consider to requiring PyArrow in 3.0 DISC: Consider not requiring PyArrow in 3.0 Feb 5, 2024
@WillAyd
Copy link
Member

WillAyd commented Feb 5, 2024

To be clear at no point was I suggesting we rewrite pyarrow; what I showed is simply an extension array that uses arrow native storage. That is a much smaller scope than what some of the discussions here have veered towards

I don't think any of the arguments raised in this discussion are a surprise and I still vote to stick with the PDEP. I think if that got abandoned but we still wanted Arrow strings without pyarrow then the extension array I showcased above may be a reasonable fallback and may even be easier to integrate than a "string[object]" fallback because at the raw storage level it fits the same mold as a pyarrow string array

@simonjayhawkins
Copy link
Member

I don't think any of the arguments raised in this discussion are a surprise and I still vote to stick with the PDEP. I think if that got abandoned but we still wanted Arrow strings without pyarrow then the extension array I showcased above may be a reasonable fallback and may even be easier to integrate than a "string[object]" fallback because at the raw storage level it fits the same mold as a pyarrow string array

Thanks @WillAyd for elaborating.

I think if the PDEP was revised to include something like this (not requiring pyarrow but if not installed defaulting to pyarrow memory backed array on array construction but limited functionality and advising users to install pyarrow) I would perhaps vote differently now.

So agree that, at this point in time, these alternatives perhaps only need discussion if enough people are strongly against requiring pyarrow as planned.

@jorisvandenbossche
Copy link
Member

I'm pretty lukewarm on a fallback that uses Python strings; that is functionally a huge step back from Arrow strings

Lukewarm is warm enough for me if that allows us to move forward ;)
(although to be honest I might not get the exact subtlety as a non-native speaker about how to interpret it)

To note, compared to 2.0 there is no step back though: for the many users having pyarrow, pandas 3.0 will be a big step forward for string handling, for users without pyarrow it still is a nice step forward in having a proper and strict, dedicated string dtype (I know you are comparing it to the case of requiring Arrow, but I think the comparison from where users are right now is more important)

Myself and @lithomas1 are currently working on finishing a pandas string DType using the new numpy 2.0 variable length string dtype, hopefully in time for pandas 3.0

That will be interesting to see! As I mentioned above, I am very interested in further exploring alternatives on the longer term, and we certainly should consider the numpy string dtype there as well. But for the purpose of this decision right now for 3.0, IMO we can't take it that much into account (I personally think it is not that likely it will be fully ready for 3.0, but even if ready on time then we cannot use it as the sole alternative given its numpy version requirement, so if we want to make pyarrow only a soft dependency for the string dtype, we still need the numpy object-dtype based alternative anyway short term).
(BTW it might be useful to open a separate issue to discuss if and how we want to integrate with the numpy string dtype, where we can go into more details of that approach and the current state of it)

@WillAyd
Copy link
Member

WillAyd commented Apr 28, 2024

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

@lithomas1
Copy link
Member

lithomas1 commented Apr 30, 2024

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

To be clear, this will probably be the eventual replacement for the object dtyped based numpy string array, since the new numpy string ufuncs should match the semantics for the Python string methods.

So, we'll still end up with 2 string dtypes (eventually).

As long as numpy is a hard dependency, we will probably want some sort of native numpy string dtype (since it wouldn't be ideal to copy numpy strings supplied by an user to object dtype or an Arrow dtype).

@simonjayhawkins
Copy link
Member

Ah sorry - I should have been more clear that I am -0.5 on yet another string type. I really think our string type system is starting to get confusing...

I agree.

There seemed to be hesitation/concern in deviating (revoking parts of) from the agreed PDEP when it came to revoking the warning (#57424 (comment)) and yet there IMHO has been a total deviation from the agreed PDEP with the implementation of a new string dtype with NumPy semantics (#54792).

I think this change was worthy of a PDEP in itself. Surely, the use of PyArrow and our Extension dtypes was to improve on (deviate away from) NumPy NaN semantics to a more consistent missing value indicator. I fail to see how this new string dtype (and the new fallback) is a long term benefit or aligns with one the original goals of PDEP-10 which was claimed to reduce the maintenance burden.

@jorisvandenbossche
Copy link
Member

yet there IMHO has been a total deviation from the agreed PDEP with the implementation of a new string dtype with NumPy semantics (#54792).

It is unfortunate that this wasn't properly discussed at the time of the actual PDEP (I don't remember any discussion about it (but I can certainly be misremembering, didn't check), and the PDEP text itself also just says to use "string[pyarrow]" / "the more efficient pyarrow string type" that has been available since 1.2, without mentioning anything about the consequences of this choice).

I understand that others might find this a stretch, but given our complete lack of even mentioning null semantics at the time, personally I am interpreting the PDEP as using "a string dtype backed by pyarrow". Otherwise this means a silent and implicit change of the default null semantics for our users, while that is a change that would definitely warrant its own PDEP (and which is coming). For lack of specifying this, preserving the default null semantics seems the better default position to me.

I think the discussion in #54792 already gives some context around why we think it is important to preserve the default dtypes and null semantics for now (and specifically in #54792 (comment), where I try to explain that this is meant to make the change less confusing for users)

@simonjayhawkins
Copy link
Member

and the PDEP text itself also just says to use "string[pyarrow]" / "the more efficient pyarrow string type" that has been available since 1.2

That was my understanding also that we would use the dtype that had be tried and tested, had been available to users for a while and that conformed to what i like to call the pandas string API (distinguishing the python backed and pyarrow backed extension arrays from the default object dtype) which included the pandas missing value indicator and returned pandas nullable dtypes where possible.

without mentioning anything about the consequences of this choice

yes, I can see that this omission is maybe another reason why we may need to revise (re-vote on) PDEP-10 if people think they would have voted differently with this being more explicitly outlined.

For lack of specifying this, preserving the default null semantics seems the better default position to me.

Maintaining the status quo is probably a good position when there isn't consensus how to proceed. However, if we are not requiring PyArrow for 3.0 then maybe we are not yet ready to change the default string dtype.

It is also better support for other dtypes that requiring PyArrow would have given. I think that PDEP-13 maybe expects PyArrow to be required?

@jorisvandenbossche
Copy link
Member

maybe we are not yet ready to change the default string dtype

Can you explain why you think this is the case?

@simonjayhawkins
Copy link
Member

Well, where on the roadmap do we transition to the pandas string API or is this new string dtype considered a long term solution. Also, why do we need a fallback #58451? I assume this is only needed if PyArrow is not a required dependency and if that is the case PDEP-10 is voided, including the change in string dtype, as we cannot deliver the other dtype improvements either?

@jorisvandenbossche
Copy link
Member

The current StringDtype enabled by pd.options.future.infer_string = True follows the "pandas string API" if you understand this as all the string-specific functionality (mainly the .str accessor) we have and document: https://pandas.pydata.org/docs/dev/user_guide/text.html

@simonjayhawkins
Copy link
Member

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

@lithomas1
Copy link
Member

I would personally prefer that the numpy string semantics be (eventually) left to the numpy StringDType and that Arrow strings should have Arrow semantics.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 2, 2024

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

No, it is the current, potentially temporary solution for pyarrow-backed arrays which are enabled by default (so right now this is only the pyarrow-backed string dtype). And the "potentially temporary" part is dependent on a separate discussion about how to move forward to nullable dtypes in general (not specific to strings).

I would personally prefer that the numpy string semantics be (eventually) left to the numpy StringDType and that Arrow strings should have Arrow semantics.

To be clear this is not about "string semantics". The string semantics for the different variants of our StringDtype are all the same. It is about the missing value sentinel, which is not string specific. And I will argue that long term we want to have the same missing value semantics regardless of whether a dtype is backed by numpy or pyarrow. But again that is for a separate discussion (beyond 3.0). It is unfortunate that we haven't resolved the missing values discussion before doing the string dtype, causing those additional dtypes and not-completely-orthogonal discussions that are hard to separate, but that is the reality we have to live with.

@lithomas1
Copy link
Member

lithomas1 commented May 2, 2024

To be clear this is not about "string semantics". The string semantics for the different variants of our StringDtype are all the same. It is about the missing value sentinel, which is not string specific. And I will argue that long term we want to have the same missing value semantics regardless of whether a dtype is backed by numpy or pyarrow. But again that is for a separate discussion (beyond 3.0). It is unfortunate that we haven't resolved the missing values discussion before doing the string dtype, causing those additional dtypes and not-completely-orthogonal discussions that are hard to separate, but that is the reality we have to live with.

Ah ok.

So the plan is still to move towards a StringDtype backed by pd.NA if I'm understanding you correctly?

Is part of the hesitation to go directly to a StringDtype with pd.NA as the missing value because we still support doing string operations (e.g. using the .str accessor) on object columns?

If I'm following correctly, right now we have the in terms of string arrays/dtypes,

Existing

StringArray (uses object dtype - essentially "python" backed,uses pd.NA, returns nullable boolean/integer arrays in some operations)
pyarrow_numpy StringArray (pyarrow backed, uses np.nan and returns numpy boolean/integer arrays in some operations)

Proposed

python_numpy StringArray (uses object dtype - essentially "python" backed, uses np.nan and returns numpy boolean/integers arrays in some operations)

where pyarrow_numpy and python_numpy are temporary "compatibility" dtypes, just so that users can upgrade to 3.0, without requiring code changes specifically for strings.

If the plan long term is to move to pd.NA, we would then deprecate these pyarrow_numpy and python_numpy dtypes for 4.0?

@simonjayhawkins
Copy link
Member

So this new string dtype (with NumPy semantics) is now the long term solution for pyarrow backed arrays in pandas?

No, it is the current, potentially temporary solution for pyarrow-backed arrays which are enabled by default (so right now this is only the pyarrow-backed string dtype). And the "potentially temporary" part is dependent on a separate discussion about how to move forward to nullable dtypes in general (not specific to strings).

Thanks for clarifying.

This is why I think that maybe we are not yet ready. We make a breaking change that is considered temporary. I'm not really in favor of this but could have lived with it as a transition if we were making PyArrow a required dependency in 3.0. (IIRC it was already introduced without a PDEP and without cc the core team on the discussions until after the work had been done and some parts already merged).

I can also see that without PyArrow as a required dependency, continuing with this new string data type gives some significant performance improvements that are equivalent to those that were initially used to partially justify having PyArrow as a required dependency.

I'm not sure if others would agree, whether a retrospective PDEP for this dtype change would be beneficial so that the change has more eyes on it. Without a PDEP, and without having PyArrow as a required dependency, maybe the new string type could remain behind the future flag for now?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 2, 2024

where pyarrow_numpy and python_numpy are temporary "compatibility" dtypes, just so that users can upgrade to 3.0, without requiring code changes specifically for strings.

Yes (users might still have to do some code changes just because it is no longer object dtype but a string dtype, but users shouldn't need to make code changes related to missing values because of the new string dtype)

If the plan long term is to move to pd.NA, we would then deprecate these pyarrow_numpy and python_numpy dtypes for 4.0?

If we get agreement on that long term plan for pd.NA, then yes (but the same applies to all our default dtypes right now that use NaN/NaT as missing value sentinel, also all those other dtypes will need a transition, so this is not specific to those new string dtypes)

This is why I think that maybe we are not yet ready. We make a breaking change that is considered temporary.

The "breaking" aspect (the fact that it will be a "string" dtype and no longer object dtype) is not a temporary change, though. The aspect about NaN might be temporary, but that is exactly done to make this part non-breaking at this moment (current columns with string values as object dtype also use NaN as missing value sentinel)

We are not ready to make a string dtype + NA the default, but IMO we are perfectly ready to just make a string dtype the default (without missing value change).

@simonjayhawkins
Copy link
Member

We are not ready to make a string dtype + NA the default, but IMO we are perfectly ready to just make a string dtype the default (without missing value change).

This seems reasonable on the proviso that the fallback #58451 behaves exactly the same (and this PR is not yet ready, in your own words "There are a bunch more of such changes needed ... to properly handle the case without pyarrow installed.", otherwise we don't address @jbrockmendel original concerns about different behaviors.

We should also probably remove the pyarrow_numpy naming, which I know you were uncomfortable with. This very naming suggests some sort of Frankenstein dtype. as @WillAyd states in #54792 (comment) the additional dtype could create more user confusion. So for the user they should probably only see "String" and we keep the pyarrow implementation a hidden detail.

I guess this would make the current breaking change, to a pandas native string dtype and a future breaking change to using pd.NA (either in conjunction or separate from returning pandas nullable types from operations such as str.len) more palatable.

Either way, I really think a PDEP was needed and it maybe not to late to do this? There are significant design decisions that have to date only been discussed between a small group and the more eyes the more likely a better (or the best) solution will be delivered.

@simonjayhawkins
Copy link
Member

Myself and @lithomas1 are currently working on finishing a pandas string DType using the new numpy 2.0 variable length string dtype, hopefully in time for pandas 3.0. This would have to be gated behind numpy runtime version check, but also a possible option for users who have numpy 2.0 installed.

The new string dtype (PyArrow backed with NumPy schematics) that is being proposed for 3.0 (#57073 (comment)) was originally incorporated into pandas when it was assumed that PyArrow was going to become a required dependency. The logic for this was it was felt that the consequences of having it use the pandas missing value indicator and returning other pandas nullable types had not been discussed properly in the PDEP (#57073 (comment))

The idea of a fallback, so that PyArrow was not needed to be a required dependency, had some pushback in the original PDEP discussion, due to both behavioral differences and performance concerns. "Offering it to users is doing them a disservice." was one comment and "Personally, I don't see a point for the string[python] implementation anymore. Non expert users end up with this dtype more often than not and wonder why it's slow." was another.

This issue was originally open with the title "DISC: Don't require PyArrow in pandas 3.0, require NumPy 2.0+, use NumPy's StringDType". Having NumPy 2.0 as a required dependency was dismissed but at the time the fallback option was not on the table, so this option was rejected.

Time has moved on and we are now again proposing a fallback option.

As @jorisvandenbossche mentions, we need an agreement on the long term plan for pd.NA for all data types so that the default pandas types are the pandas nullable types.

There are some comments here and elsewhere regarding the confusion of mixing the NumPy and PyArrow semantics. (The term PyArrow is to some extent interchangeable with pandas nullable types and NumPy is to some extent interchangable with pandas current type system). If we cannot get away from having a fallback option then a native string type for pandas using the current type system (NumPy Semantics) could maybe be best achieved using the new Numpy string dtype.

Finally, #57073 (comment), does not address the other dtype improvements that we planned to deliver. So personally, I disagree that the proposal honors the gist of the PDEP.

@simonjayhawkins
Copy link
Member

Assuming that the fallback option could be removed once PyArrow becomes a required dependency if using the proposed new string dtype or could be removed once the minimum required version of Numpy is 2.0 if we used the new NumPy string dtype for a default pandas native dtype.

In general what do people expect these timefames to be. If we introduce a temporary transitional dtype that somehow remains the default for a longer term.

Also, once we have agreed on moving forward with the pandas nullable types as the defaults, do we expect the pandas current type system to remain for backwards compatibility? Do we then deprecate the pyarrow_numpy string dtype, keep it, or replace with the new Numpy native String dtype.

@lithomas1
Copy link
Member

lithomas1 commented May 3, 2024

I guess the next steps here would be to:

A) Formally reject/accept PDEP-10.

I will open a PR to change the status of PDEP-10 to rejected, and we should vote on that. If that vote fails, I guess we'll just have to go with accepting pyarrow as a dependency.

B) Specifically open a PDEP about untangling the string dtypes

What we should cover in this PDEP are to

  1. Fix the naming scheme on e.g. pyarrow_numpy

  2. Propose a way to be able to infer string columns as a string dtype instead of object

    • This is where Joris's proposed temporary fallback of the python_numpy (to be renamed following 1) string dtype comes in
  3. Deprecate string methods/operations on object dtype

  4. We should also cover in this PDEP what to do for both cases where either nullable dtypes (pd.NA) becomes default or not.

    • I think we should cover how to deprecate/remove the python_numpy string dtype, and potentially the pyarrow_numpy dtype here
    • The numpy based StringArray (with pd.NA as its missing value) is also a candidate for removal here
  5. Figure out whether or not to integrate numpy's StringDType

C) Push nullable dtypes to be default in some future version of pandas. This might be a topic fit for PDEP-13.

Did I summarize the conversation here accurately?

@jorisvandenbossche
Copy link
Member

Either way, I really think a PDEP was needed and it maybe not to late to do this?

If my attempt to find consensus for a compromise about how to move forward on the short term for 3.0 (#57073 (comment)) doesn't work out, then yes a PDEP will help and I am happy to write one, specifically for what I proposed above.

Personally I would only do the effort of writing it in case of 1) the scope is limited to just what to do for a default string dtype in pandas 3.0, and 2) if we can actually move forward with the implementation in parallel so that if the PDEP gets approved we are better prepared to do a 3.0 release (we have plenty of prior cases where we have had features behind an option flag before it was officially approved to make it the default, and also we actually already have this in released pandas for the pyarrow-backed version of the dtype).

@jorisvandenbossche
Copy link
Member

@lithomas1 sorry, our posts crossed. I am personally fine with your top-level items, but as I mentioned in my previous post, I would personally keep a PDEP on the string dtype more limited (deprecating string methods for object dtype can be considered later, what to do when NA becomes the default can be left for the PDEP about NA, and I don't think we need to make a final decision on the usage of np.StringDType for pandas 3.0 (as IMO that also doesn't influence what we should do for pandas 3.0)).

@jorisvandenbossche
Copy link
Member

My attempt at writing a PDEP for this -> #58551
This tries to summarize the (long) history and context, and essentially formalizes my proposal above from a few days ago (#57073 (comment)) about how to move forward on this topic on the short-term for pandas 3.0.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented May 4, 2024

Personally I would only do the effort of writing it in case of 1) the scope is limited to just what to do for a default string dtype in pandas 3.0, and 2) if we can actually move forward with the implementation in parallel so that if the PDEP gets approved we are better prepared to do a 3.0 release

Sure. That's a great idea. We do not need to agree on, just give due consideration to the bigger picture and also start reviewing the fallback PR and getting parts merged without any blocking as the final decision is gated behind the PDEP.

Thanks @jorisvandenbossche for putting the PDEP together. I am much more comfortable now, especially since PDEP-1 explicitly mentions Adding a new data type has impact on a variety of places that need to handle the data type. Such wide-ranging impact would require a PDEP.

I did feel that PDEP-10 had not made provision for deviating from the existing String types and also if we do not make PyArrow a required dependency we would need to revoke PDEP-10 and so would not have any formal agreement for implementing a native string type in 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests