Consider how to deal with the proliferation of decoder options on open_dataset #939

shoyer · 2016-08-04T01:57:26Z

There are already lots of keyword arguments, and users want even more! (#843)

Maybe we should use some sort of object to encapsulate desired options?

mcgibbon · 2016-08-04T19:55:10Z

We already have the dictionary. Users can make a decode_options dictionary, and then call what they want to with **decode_options.

shoyer · 2017-05-10T21:53:31Z

One advantage to creating classes to encapsulate these options is that it's easier to do error checking on field names and values than a dictionary. Though a dictionary is nice and lightweight.

The simplest thing to do would be to move these options out of open_dataset into a single other argument, e.g., open_dataset(filename, **kwargs) -> open_dataset(filename, decode_options=kwargs).

mcgibbon · 2017-05-10T23:26:57Z

I would disagree with the form open_dataset(filename, decode_options=kwargs) over open_dataset(filename, **kwargs), because the former breaks normal Python style. It would make the documentation for the arguments somewhat awkward ("decode_options is a dictionary which can have any of the following keys [...]"). It also forces the user to use a dictionary instead of having the option to use a dictionary or the regular style of entering kwargs.

What do you mean when you say it's easier to do error checking on field names and values? The xarray implementation can still use fields instead of a dictionary, with the user saying open_dataset(filename, **kwargs) if they feel like it. I think I'm not understanding something here.

shoyer · 2017-05-10T23:51:49Z

My concern is that open_dataset has 1 required and 12 optional arguments. This is too many to easily understand, and is generally considered poor software design. The standard solution is to group related options into objects, e.g., DecoderOptions or just Decoder (if we want to bundle related methods on the object).

pandas.read_csv is a more extreme example of the same issue.

mcgibbon · 2017-05-11T00:16:34Z

It is considered poor software design to have 13 arguments in Java and other languages which do not have optional arguments. The same isn't necessarily true of Python, but I haven't seen much discussion or writing on this.

I'd much rather have pandas.read_csv the way it is right now than to have a ReadOptions object that would need to contain exactly the same documentation and be just as hard to understand as read_csv. That object would serve only to separate the documentation of the settings for read_csv from the docstring for read_csv. If you really want to cut down on arguments, open_dataset should be separated into multiple functions. I wouldn't necessarily encourage these, but some possibilities are:

Have a function which takes in an undecoded dataset and returns a CF-decoded dataset, instead of a decode_cf kwarg
Have a function which takes in an unmasked/unscaled dataset and returns a masked/scaled dataset, instead of mask_and_scale
Have a function which takes in a dataset with undecoded times and returns a decoded dataset, instead of decode_times
similarly for decode_coords, chunks, and drop_variables. Should chunks and drop_variables even exist as kwargs, given that the functions to do these to a dataset already exist?

All of that aside, the DecoderOptions object already exists if that's what you want - it's the dict.

shoyer · 2017-05-11T00:41:21Z

I'd much rather have pandas.read_csv the way it is right now than to have a ReadOptions object that would need to contain exactly the same documentation and be just as hard to understand as read_csv.

Certainly I agree here. The alternative would be separating out related functionality into related groups, e.g., NameOptions, ParserOptions, MissingValueOptions, DatetimeOptions, etc., basically exactly the groupings you see in the pandas docs.

dopplershift · 2017-05-11T16:08:19Z

I agree that having too many keyword arguments is poor design; it's representative of either failing to abstract anything away or having the object/function just do too much. For a specific example, this jumps out to me as a problem:

        ds = conventions.decode_cf(
            store, mask_and_scale=mask_and_scale, decode_times=decode_times,
            concat_characters=concat_characters, decode_coords=decode_coords,
            drop_variables=drop_variables)

Already open_dataset takes 5 parameters just to pass on directly to another function. This means to add a 6th to decode_cf, you have to update the code and doctstring there, and then make those same changes to open_dataset. Now, you could argue that they're used again within the function within open_dataset

            token = tokenize(file_arg, group, decode_cf, mask_and_scale,
                             decode_times, concat_characters, decode_coords,
                             engine, chunks, drop_variables)

but again you're using all of these parameters together. If all of these variable values are needed to define the state, you already have an implicit object in your code; you're just not using the language syntax to help you by encapsulating it.

I'd be in favor of having lightweight classes (essentially mutable named tuples) vs. dictionaries. The former allows more discoverability to the interface (i.e. tab completion in IPython) as well as better up-front error checking (you could use __slots__ to permit only certain attributes). My experience with assembling dictionaries for options is a world of typo-prone pain; trying to prevent that is especially important when teaching new users. You could still give this class the right hooks (e.g. __iter__, asdict) to allow it to be passed as **kwargs to decode_cf.

dcherian · 2020-10-06T15:39:11Z

See #4490 for a concrete API proposal. I think we can move the discussion there.

ocefpaf mentioned this issue Aug 4, 2016

Don't convert time data to timedelta by default #940

Closed

dcherian added the API design label Jan 22, 2019

dcherian closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider how to deal with the proliferation of decoder options on open_dataset #939

Consider how to deal with the proliferation of decoder options on open_dataset #939

shoyer commented Aug 4, 2016

mcgibbon commented Aug 4, 2016

shoyer commented May 10, 2017

mcgibbon commented May 10, 2017

shoyer commented May 10, 2017 •

edited

Loading

mcgibbon commented May 11, 2017

shoyer commented May 11, 2017

dopplershift commented May 11, 2017

dcherian commented Oct 6, 2020

Consider how to deal with the proliferation of decoder options on open_dataset #939

Consider how to deal with the proliferation of decoder options on open_dataset #939

Comments

shoyer commented Aug 4, 2016

mcgibbon commented Aug 4, 2016

shoyer commented May 10, 2017

mcgibbon commented May 10, 2017

shoyer commented May 10, 2017 • edited Loading

mcgibbon commented May 11, 2017

shoyer commented May 11, 2017

dopplershift commented May 11, 2017

dcherian commented Oct 6, 2020

shoyer commented May 10, 2017 •

edited

Loading