BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

jorisvandenbossche · 2023-08-28T11:44:33Z

Create a Parquet file with both string and large string type (note both are the same in Parquet, but pyarrow will restore the large_string type when reading back):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'col1': pa.array(['a', 'b', None], pa.string()), 'col2': pa.array([None, 'b', 'c'], pa.large_string())})
pq.write_table(table, "test_large_string.parquet")

Reading this file with pd.read_parquet and the future string dtype enabled (#54792):

In [27]: pd.options.future.infer_string = True

In [28]: pd.read_parquet("test_large_string.parquet").dtypes
Out[28]: 
col1    string[pyarrow_numpy]
col2                   object
dtype: object

Only the first column has the proper dtype, the column that uses pa.large_string() on the pyarrow side still falls back to object dtype.

As an example, polars users might easily get a large string type in arrow, since polars uses a large string under the hood for their string dtype.

Longer term, there might be a question whether we want to enable zero-copy support for large strings as well (e.g. right now you could already use ArrowDtype(pa.large_string()) for that). But short term (for the infer_strings option and for 3.0), I think we will want to convert large strings to normal strings and correctly use StringDtype("pyarrow_numpy"). That's not fully zero-copy (the offsets have to be casted from int64 to int32); but the current default fallback to object dtype is much worse on that topic.

One potential issue is when the large string pyarrow array cannot be casted to string because it exceeds the size that can be stored in string type. A simple cast will raise an error in that case. A workaround is splitting the array in a chunked array with multiple chunks. That's not an issue for our StringDtype, because we store the data as a chunked array anyway. But I am not sure there is an automated way to get this with pyarrow APIs.

The text was updated successfully, but these errors were encountered:

phofl · 2023-09-02T18:17:27Z

We are now correctly inferring the dtype, but we aren’t handling cases where the string dtype can’t hold the array

jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Aug 28, 2023

phofl mentioned this issue Aug 28, 2023

Infer large_string type as pyarrow_numpy strings #54826

Merged

5 tasks

jorisvandenbossche mentioned this issue Jul 29, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

jorisvandenbossche added this to the 3.0 milestone Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

phofl commented Sep 2, 2023

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

Comments

jorisvandenbossche commented Aug 28, 2023 • edited Loading

phofl commented Sep 2, 2023

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading