Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider how to deal with the proliferation of decoder options on open_dataset #939

Closed
shoyer opened this issue Aug 4, 2016 · 8 comments

Comments

@shoyer
Copy link
Member

shoyer commented Aug 4, 2016

There are already lots of keyword arguments, and users want even more! (#843)

Maybe we should use some sort of object to encapsulate desired options?

@mcgibbon
Copy link
Contributor

mcgibbon commented Aug 4, 2016

We already have the dictionary. Users can make a decode_options dictionary, and then call what they want to with **decode_options.

@shoyer
Copy link
Member Author

shoyer commented May 10, 2017

One advantage to creating classes to encapsulate these options is that it's easier to do error checking on field names and values than a dictionary. Though a dictionary is nice and lightweight.

The simplest thing to do would be to move these options out of open_dataset into a single other argument, e.g., open_dataset(filename, **kwargs) -> open_dataset(filename, decode_options=kwargs).

@mcgibbon
Copy link
Contributor

I would disagree with the form open_dataset(filename, decode_options=kwargs) over open_dataset(filename, **kwargs), because the former breaks normal Python style. It would make the documentation for the arguments somewhat awkward ("decode_options is a dictionary which can have any of the following keys [...]"). It also forces the user to use a dictionary instead of having the option to use a dictionary or the regular style of entering kwargs.

What do you mean when you say it's easier to do error checking on field names and values? The xarray implementation can still use fields instead of a dictionary, with the user saying open_dataset(filename, **kwargs) if they feel like it. I think I'm not understanding something here.

@shoyer
Copy link
Member Author

shoyer commented May 10, 2017

My concern is that open_dataset has 1 required and 12 optional arguments. This is too many to easily understand, and is generally considered poor software design. The standard solution is to group related options into objects, e.g., DecoderOptions or just Decoder (if we want to bundle related methods on the object).

pandas.read_csv is a more extreme example of the same issue.

@mcgibbon
Copy link
Contributor

It is considered poor software design to have 13 arguments in Java and other languages which do not have optional arguments. The same isn't necessarily true of Python, but I haven't seen much discussion or writing on this.

I'd much rather have pandas.read_csv the way it is right now than to have a ReadOptions object that would need to contain exactly the same documentation and be just as hard to understand as read_csv. That object would serve only to separate the documentation of the settings for read_csv from the docstring for read_csv. If you really want to cut down on arguments, open_dataset should be separated into multiple functions. I wouldn't necessarily encourage these, but some possibilities are:

  • Have a function which takes in an undecoded dataset and returns a CF-decoded dataset, instead of a decode_cf kwarg
  • Have a function which takes in an unmasked/unscaled dataset and returns a masked/scaled dataset, instead of mask_and_scale
  • Have a function which takes in a dataset with undecoded times and returns a decoded dataset, instead of decode_times
  • similarly for decode_coords, chunks, and drop_variables. Should chunks and drop_variables even exist as kwargs, given that the functions to do these to a dataset already exist?

All of that aside, the DecoderOptions object already exists if that's what you want - it's the dict.

@shoyer
Copy link
Member Author

shoyer commented May 11, 2017

I'd much rather have pandas.read_csv the way it is right now than to have a ReadOptions object that would need to contain exactly the same documentation and be just as hard to understand as read_csv.

Certainly I agree here. The alternative would be separating out related functionality into related groups, e.g., NameOptions, ParserOptions, MissingValueOptions, DatetimeOptions, etc., basically exactly the groupings you see in the pandas docs.

@dopplershift
Copy link
Contributor

I agree that having too many keyword arguments is poor design; it's representative of either failing to abstract anything away or having the object/function just do too much. For a specific example, this jumps out to me as a problem:

        ds = conventions.decode_cf(
            store, mask_and_scale=mask_and_scale, decode_times=decode_times,
            concat_characters=concat_characters, decode_coords=decode_coords,
            drop_variables=drop_variables)

Already open_dataset takes 5 parameters just to pass on directly to another function. This means to add a 6th to decode_cf, you have to update the code and doctstring there, and then make those same changes to open_dataset. Now, you could argue that they're used again within the function within open_dataset

            token = tokenize(file_arg, group, decode_cf, mask_and_scale,
                             decode_times, concat_characters, decode_coords,
                             engine, chunks, drop_variables)

but again you're using all of these parameters together. If all of these variable values are needed to define the state, you already have an implicit object in your code; you're just not using the language syntax to help you by encapsulating it.

I'd be in favor of having lightweight classes (essentially mutable named tuples) vs. dictionaries. The former allows more discoverability to the interface (i.e. tab completion in IPython) as well as better up-front error checking (you could use __slots__ to permit only certain attributes). My experience with assembling dictionaries for options is a world of typo-prone pain; trying to prevent that is especially important when teaching new users. You could still give this class the right hooks (e.g. __iter__, asdict) to allow it to be passed as **kwargs to decode_cf.

@dcherian
Copy link
Contributor

dcherian commented Oct 6, 2020

See #4490 for a concrete API proposal. I think we can move the discussion there.

@dcherian dcherian closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants