-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: new string dtype fails with >2 GB of data in a single column #56259
Comments
Another occurrence of the same underlying issue is when a string columns gets created from some operation, for example starting from a column that fits in a single pyarrow StringArray, but becomes too big through the operation. The error you then see: import numpy as np
import string
data = ["".join(np.random.choice(list(string.ascii_letters), 10)) for _ in range(1_000_000)]
pd.options.future.infer_string = True
ser = pd.Series(data * 100) # <-- a bit more than 1GB, fits in a single chunk
which fails because pyarrow currently hasn't any smart logic of splitting the result when needed (it's an element wise kernel, and so it preserves the input chunking). |
Hiya, I notice that #56220 addressed the issue @jorisvandenbossche described, but I've encountered another occurance of I have a ~6.9GB .parquet file, while rather unorthodox, includes an index column produced by
Reading this
and I encounter the same ArrowCapacityError again. Info on my system and env: Edit: added stacktrace below
|
Can you provide a reproducer? We won't be able to help otherwise It's possible that your DataFrame has arrow backed dtypes and that this uses pd.ArrowDtype(pa.string()) instead of large string, but this is outside of our control if your parquet file is saved like this |
@sqr00t also, with which version of pandas and pyarrow are you reading this file? And did you enable certain pandas options? (because if you are not reading this with the latest pandas with the infer_string option set, it's probably not related to this issue, but an issue with pyarrow itself, given that the error comes from reading the parquet file itself with pyarrow before pandas is involved) |
@phofl @jorisvandenbossche Thanks for both your replies and apologies for any confusion. With pandas 2.1.4 (latest stable), and enable infer_string option with Appreciate the help! |
We don't patch bugs in 2.0.x anymore, so the only thing you can do is upgrade |
(Patrick already opened PR #56220 with a change that addresses this, but posting it here anyway to have an example reproducer and some context for one of the reasons to switch to large string)
With the new String dtype enabled, and creating a Series with string data that holds more than 2GB of data:
then the
take
function fails:This is because of a current limitation of the
take
compute function in Arrow for ChunkedArrays that cannot be combined into a single Array (apache/arrow#25822, apache/arrow#33049). And this issue specifically comes up with the Arrowstring
type, which uses int32 offsets and thus has a max of 2147483647 ((2 ** 32) / 2)
) characters for all values in a single array, which is around 2G of data.Arrow has a
large_string
type which uses int64 offsets instead, and essentially removes an upper limit (at least for pandas' use cases, it's in the petabytes).And given that this happens in the
take
operation, this has a large consequence on various pandas operations (alignment, reindexing, joining, ...) that will fail with the current string dtype if you have a larger dataset.xref #54792
The text was updated successfully, but these errors were encountered: