-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: enable in SQL IO + resolve all xfails #60255
Conversation
arr = maybe_infer_to_datetimelike(arr) | ||
if dtype_backend != "numpy" and arr.dtype == np.dtype("O"): | ||
convert_to_nullable_dtype = dtype_backend != "numpy" | ||
arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised by the fact that maybe_infer_to_datetimelike
would change an object dtype to a string datatype; might be a cause of unexpected behavior elsewhere where this is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a very confusing name for what it does (by accident?) in practice nowadays ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not necessarily for now, but long term I think we should directly use maybe_convert_objects
here instead of maybe_infer_to_datimelike
as a small wrapper around it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I opened #60258 to keep track of this
@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype): | |||
return date | |||
elif isinstance(sqltype, Boolean): | |||
return bool | |||
elif isinstance(sqltype, String): | |||
if using_string_dtype(): | |||
return StringDtype(na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hard-coded the nan value here for backwards compatability, but this might become a code smell in the future. Have we been trying to avoid this pattern @jorisvandenbossche or is it expected for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the way I also did it elsewhere in code paths that needs to construct the default string dtype (harcode the nan value, and the storage gets determined automatically based on whether pyarrow is installed / based on the pd.options)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, is this change required? It doesn't seem that this is actually used in the one place where _get_dtype
gets used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required - I somehow did not commit the part that uses it in the caller. Stay tuned...
26e42d2
to
a884333
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
arr = maybe_infer_to_datetimelike(arr) | ||
if dtype_backend != "numpy" and arr.dtype == np.dtype("O"): | ||
convert_to_nullable_dtype = dtype_backend != "numpy" | ||
arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a very confusing name for what it does (by accident?) in practice nowadays ..
@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype): | |||
return date | |||
elif isinstance(sqltype, Boolean): | |||
return bool | |||
elif isinstance(sqltype, String): | |||
if using_string_dtype(): | |||
return StringDtype(na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the way I also did it elsewhere in code paths that needs to construct the default string dtype (harcode the nan value, and the storage gets determined automatically based on whether pyarrow is installed / based on the pd.options)
@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype): | |||
return date | |||
elif isinstance(sqltype, Boolean): | |||
return bool | |||
elif isinstance(sqltype, String): | |||
if using_string_dtype(): | |||
return StringDtype(na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, is this change required? It doesn't seem that this is actually used in the one place where _get_dtype
gets used?
@@ -2205,7 +2210,7 @@ def read_table( | |||
elif using_string_dtype(): | |||
from pandas.io._util import arrow_string_types_mapper | |||
|
|||
arrow_string_types_mapper() | |||
mapping = arrow_string_types_mapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops :)
arr = maybe_infer_to_datetimelike(arr) | ||
if dtype_backend != "numpy" and arr.dtype == np.dtype("O"): | ||
convert_to_nullable_dtype = dtype_backend != "numpy" | ||
arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not necessarily for now, but long term I think we should directly use maybe_convert_objects
here instead of maybe_infer_to_datimelike
as a small wrapper around it)
990bdc7
to
1abf86c
Compare
I think the issue with the non-pyarrow future string build has to do with the tests running in parallel. @mroeschke maybe I am misunderstanding, but isn't each module supposed to run all of its tests on one worker? Wonder why this build in particular is the only one that seems to fail |
I changed the execution method to worksteal a few months ago (test ran much faster that way). I suspect that possibly not all the fixtures correctly create and clean up the tables they create, and since the ADBC and sqlalchemy connections reference the same table names on the same dbs, the (In an ideal world, teach test should work with a unique table name which is cleaned up appropriately upon failure/success) |
Ah OK interesting...I wonder how that hasn't affected the main CI jobs to date...
They do as long as they only operate within a single thread, but with multiple threads there is nothing that prevents the tests from getting a race condition, since a lot of the test table names are hard-coded (although we probably could change them to a uuid) I'm guessing the fix for now would be to add a pytestmarker for single_cpu to the test_sql.py module (?) |
1f38f1d
to
41af90a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
elif ( | ||
using_string_dtype() | ||
and is_string_dtype(col_type) | ||
and is_object_dtype(self.frame[col_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the check for object dtype of the column needed? (or the string dtype might already have been inferred when constructing the DataFrame?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for example a column of all missing values, that might get inferred as float by the DataFrame constructor? In that case maybe rather not is_string_dtype
instead of is_object_dtype
? (Or just pass a copy=False
to astype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed, although admittedly the patterns used here are not very clear.
Without the is_object_dtype
check, it is possible that a series that is constructed as a StringDtype using pd.NA as the na_value sentinel gets changed to using the np.nan
sentinel, which fails a good deal of tests
As an alternative, we might be able to infer the na_value
to use from the executing environment rather than hard-coding np.nan on L1436?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be clearer then to use if dtype_backend == "numpy"
in the if clause? (instead of is_object_dtype
) I.e. only do this cast to StringDtype in default backend mode.
That's also what the case just above and below actually use .. (so maybe those could be combined within a single if dtype_backend == "numpy"
block)
(this also makes me wonder if we should call _harmonize_columns
at all in case of a pyarrow dtype_backend, because at that point the self.frame
already has all ArrowDtype columns. But I suppose this method handles some additional things like the parse_dates
keyword?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to revert this to using is_object_dtype
instead of just checking the dtype_backend
.
The problem is showing up in the latest CI failure because of the NUMERIC type in postgres, which is akin to the decimal data type in Python. The current SQL code treats that as an object
dtype, so this branch ends up getting activated and converts that data into an object
type array, rather than keeping as float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this branch get activated in that case of decimals if there is a is_string_dtype(col_type)
in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because col_type
is object
>>> from pandas.core.dtypes.common import is_string_dtype
>>> is_string_dtype(object)
True
is that not expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I constantly forget that is_string_dtype
is a bit of a misnomer ..
Ideally we have a check that just passes for actual string dtypes, but let's leave that for another time.
Hmm one more failing test now with |
1bead7f
to
9eb9232
Compare
Thanks @WillAyd! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
(cherry picked from commit ba4d1cf)
Manual backport -> #60315 |
…60255) (#60315) (cherry picked from commit ba4d1cf) Co-authored-by: Will Ayd <[email protected]>
xref #54792