You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a Parquet file with both string and large string type (note both are the same in Parquet, but pyarrow will restore the large_string type when reading back):
Reading this file with pd.read_parquet and the future string dtype enabled (#54792):
In [27]: pd.options.future.infer_string = True
In [28]: pd.read_parquet("test_large_string.parquet").dtypes
Out[28]:
col1 string[pyarrow_numpy]
col2 object
dtype: object
Only the first column has the proper dtype, the column that uses pa.large_string() on the pyarrow side still falls back to object dtype.
As an example, polars users might easily get a large string type in arrow, since polars uses a large string under the hood for their string dtype.
Longer term, there might be a question whether we want to enable zero-copy support for large strings as well (e.g. right now you could already use ArrowDtype(pa.large_string()) for that). But short term (for the infer_strings option and for 3.0), I think we will want to convert large strings to normal strings and correctly use StringDtype("pyarrow_numpy"). That's not fully zero-copy (the offsets have to be casted from int64 to int32); but the current default fallback to object dtype is much worse on that topic.
One potential issue is when the large string pyarrow array cannot be casted to string because it exceeds the size that can be stored in string type. A simple cast will raise an error in that case. A workaround is splitting the array in a chunked array with multiple chunks. That's not an issue for our StringDtype, because we store the data as a chunked array anyway. But I am not sure there is an automated way to get this with pyarrow APIs.
The text was updated successfully, but these errors were encountered:
Create a Parquet file with both string and large string type (note both are the same in Parquet, but pyarrow will restore the large_string type when reading back):
Reading this file with
pd.read_parquet
and the future string dtype enabled (#54792):Only the first column has the proper dtype, the column that uses
pa.large_string()
on the pyarrow side still falls back to object dtype.As an example, polars users might easily get a large string type in arrow, since polars uses a large string under the hood for their string dtype.
Longer term, there might be a question whether we want to enable zero-copy support for large strings as well (e.g. right now you could already use
ArrowDtype(pa.large_string())
for that). But short term (for the infer_strings option and for 3.0), I think we will want to convert large strings to normal strings and correctly use StringDtype("pyarrow_numpy"). That's not fully zero-copy (the offsets have to be casted from int64 to int32); but the current default fallback to object dtype is much worse on that topic.One potential issue is when the large string pyarrow array cannot be casted to string because it exceeds the size that can be stored in string type. A simple cast will raise an error in that case. A workaround is splitting the array in a chunked array with multiple chunks. That's not an issue for our StringDtype, because we store the data as a chunked array anyway. But I am not sure there is an automated way to get this with pyarrow APIs.
The text was updated successfully, but these errors were encountered: