Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize timestamps in arrow tables #3076

Merged
merged 6 commits into from
Jun 7, 2023

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Jun 2, 2023

Closes #3050 by converting timestamps columns in Arrow tables to ISO-8601 strings before JSON serialization.

I added a test, though it doesn't look like we had any existing test coverage for the DataFrame interchange protocol support.

cc @djouallah

@mattijn
Copy link
Contributor

mattijn commented Jun 2, 2023

Thanks for pushing this @jonmmease! Would be interesting to see if we slowly can phase-in the dataframe interchange protocol for all dataframe-a-likes (and simultaneously phase-out the dependency on pandas).

Can you move the _sanitize_arrow_table to core.py below this function https://github.com/altair-viz/altair/blob/master/altair/utils/core.py#L287?

Would it also be possible to include a test including a NaT equivalent (None I think).

By the way, on windows the test is not passing. It gives the following error:

ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found at D:\Users\Hoek\Downloads\tzdata

See also apache/arrow#31472 and apache/arrow#35637 (comment). That probably require some thought how we approach this.

@jonmmease
Copy link
Contributor Author

Can you move the _sanitize_arrow_table to core.py below this function https://github.com/altair-viz/altair/blob/master/altair/utils/core.py#L287?

Done in 58f11ba

Would it also be possible to include a test including a NaT equivalent (None I think).

Done in 9c6ea83

Would be interesting to see if we slowly can phase-in the dataframe interchange protocol for all dataframe-a-likes (and simultaneously phase-out the dependency on pandas).

Yeah, that's an interesting possibility. I'm a bit leery of adding a required dependency on pyarrow since it's not available everywhere pandas is (e.g. pyodide). But the pandas project itself is currently debating whether to take on pyarrow as a required dependency anyway, so that might be the future direction the ecosystem goes: pandas-dev/pandas#52509, pandas-dev/pandas#52711

@mattijn
Copy link
Contributor

mattijn commented Jun 2, 2023

I agree. I think it is unfortunate that we, for now, have to rely on pyarrow to get support for the current implementation of the dataframe interchange protocol.

If there is a WASM friendly option to support the dataframe interchange protocol, I would favor that one.

@jonmmease
Copy link
Contributor Author

By the way, on windows the test is not passing

Given that this is a known upstream pyarrow issue, are you comfortable with this being merged?

@mattijn
Copy link
Contributor

mattijn commented Jun 4, 2023

Yes not a blocker from my side. Also noticed that wasm support for pyarrow is something that is actively worked on, see: pyodide/pyodide#2933, that is positive.

Haven't had the chance to test this PR, which I would like to do, please give me a few more days.

@jonmmease
Copy link
Contributor Author

Haven't had the change to test this PR, which I would like to do, please give me a few more days.

Sure thing. Thanks for taking a look. I was thinking that if the test doesn't work on windows, maybe we should skip it when the current platform is windows.

@mattijn
Copy link
Contributor

mattijn commented Jun 7, 2023

Getting this tz_data folder on the right place was a bit strange for my machine, see apache/arrow#35600 (comment).
Skipping this test on windows seems fine to me, see apache/arrow#35735 how they've done this recently as well within pyarrow.

Can you also have a look to this comment, which states:

Note: in case of duration, the error should be raised.

Is this something we have to consider for here? Docs for duration: https://arrow.apache.org/docs/python/generated/pyarrow.duration.html.

Also related, Timedelta:

import altair as alt
import pandas as pd
import pyarrow as pa
from altair.utils.data import to_values

td = pd.timedelta_range(0,periods=3,freq='h')
df = pd.DataFrame(td).reset_index()
df.columns = ['id', 'timedelta']

pa_table = pa.table(df)
values = to_values(pa_table)
values
{'values': [{'id': 0, 'timedelta': Timedelta('0 days 00:00:00')},
  {'id': 1, 'timedelta': Timedelta('0 days 01:00:00')},
  {'id': 2, 'timedelta': Timedelta('0 days 02:00:00')}]}
alt.Chart(pa_table).mark_bar().encode(
    x='timedelta:O',
    y='id:O'
)
TypeError: Object of type Timedelta is not JSON serializable

See also #974

@jonmmease
Copy link
Contributor Author

Thank for taking a look!

For timedelta / duration, there isn't an equivalent Vega type, so I think the best we can do is raise a more informative error. I'll update the sanitize arrow function to do this.

Skipping this test on windows seems fine to me

Will do, thanks for the references

@jonmmease
Copy link
Contributor Author

Changes made. @mattijn could you try out the windows test skipping logic on your machine?

@mattijn
Copy link
Contributor

mattijn commented Jun 7, 2023

Quick @jonmmease! Yes, it is working as intended. Nice! Once tests on GA pass as well we can merge this in👍.

@jonmmease
Copy link
Contributor Author

jonmmease commented Jun 7, 2023

Thanks for the review @mattijn, test are passing. Merging!

@jonmmease jonmmease merged commit 8bee779 into master Jun 7, 2023
sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this pull request Jul 26, 2023
…ndas and into Polars. As of now this requires an unreleased change to Vega-Altair (which we pull from Git): vega/altair#3076

The `EncounterLag` chart is still on Pandas because it uses its `wide_to_long` function that doesn't exist in Polars. Also `MolecularCurrentDeltas` is looking funny, fix later
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pyarrow datetime is not supported using altair 5.0
2 participants