Rethinking `fix_file` #2129

schlunma · 2023-07-11T16:02:52Z

Currently, fx_file has the call signature (filename) -> (filename). Thus, files need to be copied if they need to be modified (overwriting input files is a very bad idea), which is very expensive. Usually, the netCDF4 library is used to perform these kind of modifications.

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris() which allows converting a DataArray into a Cube (no idea how efficient this is, though). In addition, the iris devs are working on an improved xarray-iris bridge.

Thus, it would be very nice to allow an efficient usage of xarray in fix_file. My question is now: how would that look like in practice? Possible solutions I could think of are:

Allow fx_file to return xarray objects (currently only DataArray?) and load to read these kind of objects.
Allow fx_file to return cubes and load to read cubes.
Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

The reason I am asking this now is that I want to get started with reading the ERA5 GRIB files that are available on Levante (see #1991). However, iris-grib cannot open them. On the other hand, xarray (in combination with ECMWF's cfgrib) works perfectly fine. Thus, an option to include xarray into our pipeline would be super helpful for me.

@ESMValGroup/technical-lead-development-team opinions?

The text was updated successfully, but these errors were encountered:

bouweandela · 2023-08-01T12:58:23Z

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris()which allows converting a DataArray into a Cube (no idea how efficient this is, though).

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

bouweandela · 2023-08-01T13:05:09Z

Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

bouweandela · 2023-08-01T13:20:40Z

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

valeriupredoi · 2023-08-01T13:29:01Z

AFAIK iris can't load file objects eg something stored on an S3 storage so our main problem is not fix_file but the I/O itself - if we get the I/O to work with object stores then we can modify fix_file to work on such a thing I reckon

valeriupredoi · 2023-08-01T13:31:20Z

I have experience with netCDF files on various types of storages, including object stores, so does @zklaus (am fairly sure of that) - iris is the type of thing you don't want in that case, you want a basic loader into a file format like Zarr or overloaded Zarr - beware of xarray since that transfers the data!

schlunma · 2023-08-01T13:36:08Z

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

I tested this a lot lately for the ERA5 grib files, in which case I didn't find any problems with this function. The lazyness of the data and all relevant metadata is properly preserved.

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

Conceptually I agree, but our current callback argument (which is simply passed to iris.load_raw is not suited for that. It would rather be something like a load_func option that defaults to iris.load_raw. However, to ensure backwards-compatibility, we need to allow fix_file to return paths...this will not really make it cleaner.

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

I don't think we should get rid of fix_file entirely (we cannot expected Iris to handle every single format natively). fix_file already returns what it gets if no special fix is needed. I added some type hints for that in #2160, but these can be changed to Any.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

In #2160, I allowed load to handle CubeList objects (they are simply returned), this is more flexible than relying on xarray's to_iris in my opinion (we are not sure if this works in every case), and the magic xarray -> iris can happen in fix_file. Support for other formats can be implemented to load in subsequent PRs.

schlunma · 2024-11-11T11:41:07Z

Last week I found that reading nc files with lots of variables (like we have for many native models like ICON, EMAC, etc.) can be very slow with iris (it can take hours depending on the setup). See more details here: SciTools/iris#6223. I found that ncdata can really help in reducing this time (~40% by just using xarray->iris instead of iris directly, much more if we extract individual variables with xarray).

Thus, I'd really like to implement the solution "Allow fx_file to return cubes and load to read cubes." mentioned above. I saw that this is already done in a draft PR here. @bouweandela what are your plans regarding this PR? If that's okay with you, I could extract the general parts from that PR that affect all datasets and open a new PR based on that.

With the more flexible fix_file we can then put code that extracts variables into fix_file, which, in my ICON example, will speedup data loading by more than a factor of 10.

bouweandela · 2024-11-13T21:20:53Z

@bouweandela what are your plans regarding #2454?

While working on it, I realized that the separation between fix_file/fix_metadata/fix_data does no longer have a purpose now that we're working with ncdata and dask arrays. It would much simpler to have a single fix function that takes care of all three operations, that's why the pull request is still in draft. Also, loading the data with xarray or ncdata leads to very different Dask chunks, so I was planning to investigate if that adversely affects performance.

schlunma · 2024-11-14T08:48:11Z

It would much simpler to have a single fix function that takes care of all three operations

I really like this idea!

However, I don't think this can be easily implemented in practice. Currently, the existing fix functions are not called one after each other. The order is rather

fix_file --> load --> fix_metadata --> concatenate --> cmor_check_metadata --> clip_timerange --> fix_data --> cmor_check_data

How and where in this chain would you implement this fix in practice? We could of course re-order these operations, but are we sure that this won't mess up (some of) the existing fixes? If clip_timerange is at the end of the pipeline, is Dask "smart" enough to discard the unused chunks in earlier computations? How do we make this backwards-compatible?

schlunma · 2024-12-11T13:50:39Z

@bouweandela I just opened a PR that implements the option to use xarray and ncdata objects in fix_file, just as you implemented in #2454. This is necessary for further changes which would dramatically improve the evaluation of ICON and EMAC output (see #2579). I think this would be a great first step in making our fixes more flexible.

What do you think?

schlunma added the enhancement New feature or request label Jul 11, 2023

schlunma added this to the v2.10.0 milestone Jul 11, 2023

schlunma self-assigned this Jul 11, 2023

This was referenced Jul 11, 2023

Do we still need fix_file? #463

Closed

Adding DKRZ drs for native6 to enable usage of local vast collection of ERA5 data (hourly, grb) #1991

Closed

schlunma mentioned this issue Aug 1, 2023

Updated doc about fixes and added type hints to fix functions #2160

Merged

10 tasks

schlunma removed this from the v2.10.0 milestone Aug 28, 2023

bouweandela mentioned this issue Jun 10, 2024

Avoid copying input data in CMIP6 CESM2 air pressure level fixes #2454

Draft

10 tasks

This was referenced Nov 12, 2024

Loading a netCDF file with multiple variables is very slow SciTools/iris#6223

Open

Allow fix_file to return Cube and CubeList objects #2579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking `fix_file` #2129

Rethinking `fix_file` #2129

schlunma commented Jul 11, 2023

bouweandela commented Aug 1, 2023 •

edited

Loading

bouweandela commented Aug 1, 2023 •

edited

Loading

bouweandela commented Aug 1, 2023

valeriupredoi commented Aug 1, 2023

valeriupredoi commented Aug 1, 2023

schlunma commented Aug 1, 2023

schlunma commented Nov 11, 2024

bouweandela commented Nov 13, 2024

schlunma commented Nov 14, 2024

schlunma commented Dec 11, 2024

Rethinking fix_file #2129

Rethinking fix_file #2129

Comments

schlunma commented Jul 11, 2023

bouweandela commented Aug 1, 2023 • edited Loading

bouweandela commented Aug 1, 2023 • edited Loading

bouweandela commented Aug 1, 2023

valeriupredoi commented Aug 1, 2023

valeriupredoi commented Aug 1, 2023

schlunma commented Aug 1, 2023

schlunma commented Nov 11, 2024

bouweandela commented Nov 13, 2024

schlunma commented Nov 14, 2024

schlunma commented Dec 11, 2024

Rethinking `fix_file` #2129

Rethinking `fix_file` #2129

bouweandela commented Aug 1, 2023 •

edited

Loading

bouweandela commented Aug 1, 2023 •

edited

Loading