Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calling chart.save when using __dataframe__ protocol #3109

Closed
jcrist opened this issue Jul 17, 2023 · 6 comments
Closed

Error calling chart.save when using __dataframe__ protocol #3109

jcrist opened this issue Jul 17, 2023 · 6 comments
Labels

Comments

@jcrist
Copy link

jcrist commented Jul 17, 2023

I'm trying out altair's support for the new __dataframe__ protocol, and am running into a few issues.

Given the following pyarrow & altair code (taken and modified from the getting started guide):

import altair as alt
import pyarrow as pa

data = pa.Table.from_pydict(
    {'x': ['A', 'B', 'C', 'D', 'E'],
     'y': [5, 3, 6, 7, 2]}
)

chart = alt.Chart(data).mark_bar().encode(
    x='x',
    y='y',
)
chart.save("out.html")

This outputs:

Traceback (most recent call last):
  File "/home/jcristharif/Code/altair/test.py", line 13, in <module>
    chart.save("out.html")
  File "/home/jcristharif/Code/altair/altair/vegalite/v5/api.py", line 1066, in save
    result = save(**kwds)
             ^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/save.py", line 189, in save
    perform_save()
  File "/home/jcristharif/Code/altair/altair/utils/save.py", line 127, in perform_save
    spec = chart.to_dict(context={"pre_transform": False})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/vegalite/v5/api.py", line 2677, in to_dict
    return super().to_dict(
           ^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/vegalite/v5/api.py", line 903, in to_dict
    vegalite_spec = super(TopLevelMixin, copy).to_dict(  # type: ignore[misc]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 807, in to_dict
    result = _todict(
             ^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 340, in _todict
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 340, in <dictcomp>
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
               ^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 336, in _todict
    return obj.to_dict(validate=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 807, in to_dict
    result = _todict(
             ^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 340, in _todict
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 340, in <dictcomp>
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
               ^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/utils/schemapi.py", line 336, in _todict
    return obj.to_dict(validate=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcristharif/Code/altair/altair/vegalite/v5/schema/channels.py", line 51, in to_dict
    raise ValueError("{} encoding field is specified without a type; "
ValueError: x encoding field is specified without a type; the type cannot be automatically inferred because the data is not specified as a pandas.DataFrame.

Versions:

  • python 3.11
  • altair dev branch
  • pyarrow 12.0.1
@jonmmease
Copy link
Contributor

Thanks for giving the DataFrame interchange protocol a try. It looks like what's happening here is that you need to be explicit about the column encoding type when using the DataFrame interchange protocol.

This fixes it:

import altair as alt
import pyarrow as pa

data = pa.Table.from_pydict(
    {'x': ['A', 'B', 'C', 'D', 'E'],
     'y': [5, 3, 6, 7, 2]}
)

chart = alt.Chart(data).mark_bar().encode(
    x='x:O',  # <-- Change Here
    y='y:Q',  # <-- Change Here
)
chart.save("out.html")

The :O and :Q suffixes are documented in https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types.

Altair has special logic for inferring these from Pandas DataTypes, which is why they aren't always required. But it looks like that logic isn't applied when the input data comes from a __dataframe__ object.

@binste @mattijn Do you have thoughts on how hard it would be to support encoding type inference for the DataFrame interchange protocol objects?

@jcrist
Copy link
Author

jcrist commented Jul 17, 2023

Thanks for the explanation! While that fixes it, it'd be nice if users could write the same code as the pandas version.

If we're saving the output to .html (and thus have to load the data anyway I think to save it to the output) could we instead convert the input to a pandas dataframe using pd.api.interchange.from_dataframe then fallback to the pandas code path here? Are there downsides to handling it this way?

Alternatively the inference logic could maybe be implemented to infer from the DtypeKind in the __dataframe__ protocol (as seen here https://data-apis.org/dataframe-protocol/latest/API.html)?

@jonmmease
Copy link
Contributor

it'd be nice if users could write the same code as the pandas version.

Totally agree

could we instead convert the input to a pandas dataframe using pd.api.interchange.from_dataframe then fallback to the pandas code path here? Are there downsides to handling it this way?

When VegaFusion is active, it intercepts the __dataframe__ objects and works with them in Arrow format. So we don't want to convert to pandas before that point. I don't know off-hand where the encoding type logic is compared to the data transformer logic that VegaFusion relies on.

In general, I like the idea of getting the schema from the __dataframe__ protocol itself if that's possible. This would be a step in the direction of removing the hard dependency on pandas that @mattijn has talked about.

@joelostblom
Copy link
Contributor

In addition to what @jonmmease mentioned, there is some more discussion about how to support other dataframe libraries from this comment and onwards #2868 (comment) with similar suggestions about why it might be a good idea to not rely on pandas for other protocols.

@mattijn
Copy link
Contributor

mattijn commented Jul 17, 2023

Support for the DataFrame Interchange Protocol is still experimental and it currently lacks features as type inference.

Another related quote from the same issue @joelostblom refers to: #2868 (comment)

I wouldn't want Polars users to get used to not having to provide a data type, and then have that suddenly break in the future. So my first thought was to provide less support initially (it seems easier to add more support later).

Once pyarrow is wasm-friendly: pyodide/pyodide#2933 and/or pandas has pyarrow as a hard dependency pandas-dev/pandas#52711 we can phase-in the dataframe interchange protocol for all dataframe-a-likes (and simultaneously phase-out the dependency on pandas).

The question remains if the current type inference for pandas dataframes can be used for all dataframes (that are parsed through the dataframe interchange protocol) or if this needs to be done differently.

Somehow related: #3076, first PR that works on data serialization of data parsed through the dataframe interchange protocol.

Additional improvements are welcome 🤗

@mattijn
Copy link
Contributor

mattijn commented Jul 17, 2023

Closing this issue in favor of #3112. Thanks for raising @jcrist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants