You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The future default string dtype (#54792) can be enabled with pd.options.future.infer_string = True, and then pandas will use the StringDtype(storage="pyarrow_numpy") dtype in constructors and IO methods.
However, we also have an option to set the default storage for this StringDtype (pd.options.mode.string_storage), which isn't changed by setting the future option, and thus still uses its default value of "python". As a result, when someone specifies the generic "string" dtype (without explicit parametrization), we still default to this python-based string dtype.
Some examples:
>>>pd.options.future.infer_string=True# this is still its default of "python">>>pd.options.mode.string_storage'python'# the default inference (not specifying a dtype) gives the new pyarrow based dtype>>>ser=pd.Series(["a", "b", None])
>>>ser0a1b2NaNdtype: string>>>ser.dtypestring[pyarrow_numpy]
# but when specifying generically to want a "string" dtype, we still use the python based dtype>>>ser=pd.Series(["a", "b", None], dtype="string")
>>>ser0a1b2<NA>dtype: string>>>ser.dtypestring[python]
The same applies to use the pd.StringDtype() generic dtype constructor instead of the "string" string, and in other places where you can specify the data type (eg .astype("string")).
When opting in to the future default string dtype, IMO the ideal (and expected) behaviour is that for things like dtype="string", the user also gets the pyarrow-based string dtype, without having to manually set two options (i.e. also set pd.options.mode.string_storage = "pyarrow_numpy", in addition to infer_strings).
One "easy" way to change this would be to let pd.options.future.infer_strings = True have a side effect of also changing the option value for string_storage. However, that might give unexpected results when for example using this option in a context manager (because I don't think we can reliably also restore the string_storage option to its original value, when setting infer_strings back to False).
The text was updated successfully, but these errors were encountered:
The safest way of doing this is when inferring the string storage in the StringDtype constructor.
Ah, yes, just checking both options there, that sounds good
What do we want to do if the users sets pd.options.mode.string_storage explicitly? Should this take precedence?
I think the most useful is to let infer_strings take precedence. I don't see a good use case for wanting another default string_storage when using infer_strings. It might only be a bit confusing for users that do that and don't see the expected effect (we could let it warn, but that means we would need to change the actual default for string_storage to None, to know if it was set by the user or not, which might be a bit complicated just for this)
From #54533 (comment)
The future default string dtype (#54792) can be enabled with
pd.options.future.infer_string = True
, and then pandas will use theStringDtype(storage="pyarrow_numpy")
dtype in constructors and IO methods.However, we also have an option to set the default storage for this StringDtype (
pd.options.mode.string_storage
), which isn't changed by setting the future option, and thus still uses its default value of "python". As a result, when someone specifies the generic "string" dtype (without explicit parametrization), we still default to this python-based string dtype.Some examples:
The same applies to use the
pd.StringDtype()
generic dtype constructor instead of the"string"
string, and in other places where you can specify the data type (eg.astype("string")
).When opting in to the future default string dtype, IMO the ideal (and expected) behaviour is that for things like
dtype="string"
, the user also gets the pyarrow-based string dtype, without having to manually set two options (i.e. also setpd.options.mode.string_storage = "pyarrow_numpy"
, in addition toinfer_strings
).One "easy" way to change this would be to let
pd.options.future.infer_strings = True
have a side effect of also changing the option value forstring_storage
. However, that might give unexpected results when for example using this option in a context manager (because I don't think we can reliably also restore thestring_storage
option to its original value, when settinginfer_strings
back to False).The text was updated successfully, but these errors were encountered: