API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

jorisvandenbossche · 2023-08-28T10:15:19Z

The future default string dtype (#54792) can be enabled with pd.options.future.infer_string = True, and then pandas will use the StringDtype(storage="pyarrow_numpy") dtype in constructors and IO methods.

However, we also have an option to set the default storage for this StringDtype (pd.options.mode.string_storage), which isn't changed by setting the future option, and thus still uses its default value of "python". As a result, when someone specifies the generic "string" dtype (without explicit parametrization), we still default to this python-based string dtype.

Some examples:

>>> pd.options.future.infer_string = True
# this is still its default of "python"
>>> pd.options.mode.string_storage
'python'

# the default inference (not specifying a dtype) gives the new pyarrow based dtype
>>> ser = pd.Series(["a", "b", None])
>>> ser
0      a
1      b
2    NaN
dtype: string
>>> ser.dtype
string[pyarrow_numpy]

# but when specifying generically to want a "string" dtype, we still use the python based dtype
>>> ser = pd.Series(["a", "b", None], dtype="string")
>>> ser
0       a
1       b
2    <NA>
dtype: string
>>> ser.dtype
string[python]

The same applies to use the pd.StringDtype() generic dtype constructor instead of the "string" string, and in other places where you can specify the data type (eg .astype("string")).

When opting in to the future default string dtype, IMO the ideal (and expected) behaviour is that for things like dtype="string", the user also gets the pyarrow-based string dtype, without having to manually set two options (i.e. also set pd.options.mode.string_storage = "pyarrow_numpy", in addition to infer_strings).

One "easy" way to change this would be to let pd.options.future.infer_strings = True have a side effect of also changing the option value for string_storage. However, that might give unexpected results when for example using this option in a context manager (because I don't think we can reliably also restore the string_storage option to its original value, when setting infer_strings back to False).

The text was updated successfully, but these errors were encountered:

phofl · 2023-08-28T10:34:32Z

I think I misunderstood you when we discussed this initially. This makes sense.

The safest way of doing this is when inferring the string storage in the StringDtype constructor.

One open question:

What do we want to do if the users sets pd.options.mode.string_storage explicitly? Should this take precedence?

jorisvandenbossche · 2023-08-28T11:29:26Z

The safest way of doing this is when inferring the string storage in the StringDtype constructor.

Ah, yes, just checking both options there, that sounds good

What do we want to do if the users sets pd.options.mode.string_storage explicitly? Should this take precedence?

I think the most useful is to let infer_strings take precedence. I don't see a good use case for wanting another default string_storage when using infer_strings. It might only be a bit confusing for users that do that and don't see the expected effect (we could let it warn, but that means we would need to change the actual default for string_storage to None, to know if it was set by the user or not, which might be a bit complicated just for this)

jorisvandenbossche added the Strings String extension data type and string data label Aug 28, 2023

This was referenced Aug 28, 2023

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

Implement Arrow String Array that is compatible with NumPy semantics #54533

Merged

phofl mentioned this issue Aug 28, 2023

Infer string storage based on infer_string option #54794

Merged

4 tasks

phofl closed this as completed in #54794 Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

phofl commented Aug 28, 2023

jorisvandenbossche commented Aug 28, 2023

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

Comments

jorisvandenbossche commented Aug 28, 2023 • edited Loading

phofl commented Aug 28, 2023

jorisvandenbossche commented Aug 28, 2023

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading