Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

Open
Tracked by #54792
jorisvandenbossche opened this issue Aug 28, 2023 · 1 comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 28, 2023

Create a Parquet file with both string and large string type (note both are the same in Parquet, but pyarrow will restore the large_string type when reading back):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'col1': pa.array(['a', 'b', None], pa.string()), 'col2': pa.array([None, 'b', 'c'], pa.large_string())})
pq.write_table(table, "test_large_string.parquet")

Reading this file with pd.read_parquet and the future string dtype enabled (#54792):

In [27]: pd.options.future.infer_string = True

In [28]: pd.read_parquet("test_large_string.parquet").dtypes
Out[28]: 
col1    string[pyarrow_numpy]
col2                   object
dtype: object

Only the first column has the proper dtype, the column that uses pa.large_string() on the pyarrow side still falls back to object dtype.

As an example, polars users might easily get a large string type in arrow, since polars uses a large string under the hood for their string dtype.

Longer term, there might be a question whether we want to enable zero-copy support for large strings as well (e.g. right now you could already use ArrowDtype(pa.large_string()) for that). But short term (for the infer_strings option and for 3.0), I think we will want to convert large strings to normal strings and correctly use StringDtype("pyarrow_numpy"). That's not fully zero-copy (the offsets have to be casted from int64 to int32); but the current default fallback to object dtype is much worse on that topic.

One potential issue is when the large string pyarrow array cannot be casted to string because it exceeds the size that can be stored in string type. A simple cast will raise an error in that case. A workaround is splitting the array in a chunked array with multiple chunks. That's not an issue for our StringDtype, because we store the data as a chunked array anyway. But I am not sure there is an automated way to get this with pyarrow APIs.

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Aug 28, 2023
@phofl
Copy link
Member

phofl commented Sep 2, 2023

We are now correctly inferring the dtype, but we aren’t handling cases where the string dtype can’t hold the array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

2 participants