Store and fast load weather data #590

MartyMcFlyInTheSky · 2024-02-23T17:48:46Z

MartyMcFlyInTheSky
Feb 23, 2024

We store large amounts of weather data and consider blosc2 to help us achieve faster loading times. The data usually comes in some sort of a time series:

4393;2015010409:30;    1;   -999;   14.0;   30.0;  47;    83.53;2015010410:00;eor
4393;2015010410:30;    1;   -999;   22.0;   51.0;  54;    79.45;2015010411:00;eor
4393;2015010411:30;    1;   -999;   26.0;   71.0;  60;    77.32;2015010412:00;eor
4393;2015010412:30;    1;   -999;   22.0;   69.0;  60;    77.32;2015010413:00;eor
4393;2015010413:30;    1;   -999;   19.0;   55.0;  60;    79.44;2015010414:00;eor
4393;2015010414:30;    1;   -999;   16.0;   28.0;  52;    83.50;2015010415:00;eor
4393;2015010415:30;    1;   -999;    6.0;    7.0;  11;    89.20;2015010416:00;eor

(open data)

There can be multiple parameters per observation, and the observatiosn are of course repeated for each weather station.

So basically what we want to try is to write a dump file that can be read as fast as possible on system restart and contains all the (necessary) timeseries information. Since I'm new to blosc2, where do you suggest me to start?

FrancescAlted · 2024-02-26T10:02:19Z

FrancescAlted
Feb 26, 2024
Maintainer

Hi! There should be plenty of ways on how you can use blosc2 to address your needs, but I am afraid that we would need some more info for better helping you. Do you plan to use Python, or just plain C? How much data are we talking about? MBs? GBs? TBs? Do you need to mostly retrieve columns or rows of data? Provided this sort of context is always useful before directing you towards one direction or another.

3 replies

MartyMcFlyInTheSky Feb 26, 2024
Author

Thanks for the reply. We use TB's of data, split into smaller files for each day. We retrieve all three, the same parameter at different times, multiple the parameters at a specific time, or multiple parameters over multiple times. We use C++. I was just looking into b2nd and it looks promising, I just don't know if there's an even better solution.

Another question that came up is: If we decide to store missing values as NaN, will the compression algorithm efficiently optimize this overhead away in case we have huge gaps and only sparse data?

FrancescAlted Feb 26, 2024
Maintainer

Right. TBs of data is the sweet spot for out-of-core blosc2. Now, if you use the b2nd metalayer, you can add multidimensionality and data types on top of a regular blosc2 frame; see e.g. https://github.com/Blosc/c-blosc2/blob/main/examples/b2nd/example_serialize.c and https://github.com/Blosc/c-blosc2/blob/main/examples/b2nd/example_plainbuffer.c.

Regarding storing missing values as NaNs, yes, blosc2 has support for storing large amounts of NaNs (in general, any run of repeated values) in an efficient way at 2 levels: chunks and frames (aka super-chunks). For using this mechanism, you should know about chunks of NaNs before hand and building the chunk manually with blosc2_chunk_nans before inserting it in the super-chunk. Then you can use the update_chunk machinery for updating the parts that are not NaNs.

MartyMcFlyInTheSky Feb 26, 2024
Author

Thank you. I'm just reading the documentation, and some knowledge gap still remain:

How can I arrange data to specific slices to make it fit the b2nd array? Or in other words, how do I have to layout the cbuffer myself or are there helper functions that help me with the arrangement of multidimensional data?
How do I start to choose dimension of chunks and blocks in blosc2 ndim? For example, this is the output of lscpu which gives me cache sizes:

L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        24 MiB
L3 cache:                        33 MiB

I read somewhere that I should make chunks fit into L3 and blocks into L1 cache... is this something to consider?

For >95% sparsity is it better to map indices using a dictionary instead of using sparse frames? What would you suggest?
There's no example of multithreaded loading/writing and compression/decompression of files. Am I right in the assumption that the compression/decompression part can be configured using the dparams.nthreads while the file reading is not really multithreadable? In that case, should I have multiple files for the data or is it advisable to use fseek().

FrancescAlted · 2024-02-26T18:25:38Z

FrancescAlted
Feb 26, 2024
Maintainer

Thank you. I'm just reading the documentation, and some knowledge gap still remain:

How can I arrange data to specific slices to make it fit the b2nd array? Or in other words, how do I have to layout the cbuffer myself or are there helper functions that help me with the arrangement of multidimensional data?

The normal thing to do is to build a super-chunk and append chunks once at a time (for example via blosc2_schunk_append_buffer()). You have a complete example here: https://github.com/Blosc/c-blosc2/blob/1dd1e55cb329d01c210da77ceb53027853c35b72/bench/trunc_prec_schunk.c

As for providing multidimensional metadata, this is achieved passing the chunkshape and blockshape params to b2nd_create_ctx. If you want to pass the data type, you can do that with the dtype param too (note that, contrarily to chunkshape and blockshape that affects to the storage layout, this is mainly metadata). See e.g. https://github.com/Blosc/c-blosc2/blob/main/bench/b2nd/bench_get_slice.c

Note that, although most of our teaching material is Python oriented, it is still worth reading it for understanding how all the process works. Once you get familiar with the Python-way, you will better grasp the C API for Blosc2.

How do I start to choose dimension of chunks and blocks in blosc2 ndim? For example, this is the output of lscpu which gives me cache sizes:
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        24 MiB
L3 cache:                        33 MiB
I read somewhere that I should make chunks fit into L3 and blocks into L1 cache... is this something to consider?

Yeah, that's the rule of thumb, but in general this requires some experimentation and knowledge about your data, but most specially, your data retrieval patterns (which is the part that you are trying to optimize). A good starting point would be to have a look at these two blogs:

https://www.blosc.org/posts/blosc2-ndim-intro/ (hints for optimizing NDim data)
https://www.blosc.org/posts/pytables-b2nd-slicing/ (hints for optimizing tabular data)

[The latter is for using Blosc2 in combination with HDF5, but provided that you are using tabular data, it might interest you]

Providing more hints for your specific case and CPU is out of scope here, but we are offering consulting services to help you on that matter. If interested, please write to [email protected].

For >95% sparsity is it better to map indices using a dictionary instead of using sparse frames? What would you suggest?

Not sure about what you are referring with a 'dictionary' here, but again, experimentation is king.

FWIW, you might want to check our article on how to handle a highly sparse grid of 7.3 TB (representing stars in the Milky Way) with some complementary metadata in tabular form:

https://conference.scipy.org/proceedings/scipy2023/pdfs/Francesc_Alted.pdf (article)
https://www.blosc.org/docs/Exploring-MilkyWay-SciPy2023.pdf (presentation)

Those experiments could be useful for your use case.

There's no example of multithreaded loading/writing and compression/decompression of files. Am I right in the assumption that the compression/decompression part can be configured using the dparams.nthreads while the file reading is not really multithreadable? In that case, should I have multiple files for the data or is it advisable to use fseek().

Multithreading is configurable for compressing (cparams.nthreads) / decompressing (dparams.nthreads), the default being 1 for both. The file reading is actually multithreaded because the blocks in chunks are read from disk independently, so in principle, there is no need to have multiple files for exercising multithreading. But again, nothing beats experimentation before finding your optimal configuration.

1 reply

MartyMcFlyInTheSky Feb 27, 2024
Author

Thanks you're golden! The article was a very interesting read - and very impressive. I think we could also benefit from a similar approach. It also raised a question:

The paper talks about a zero detection mechanism in blosc2, yet you mention that it also works for other values. Are mechanisms different here, e.g. is the zero detection pipeline independent of the compression codec (which might or might not compress recurrences of the same number efficiently)? I'm asking because we have meteorological data with a lot of -999..
Reading the python API cleared up many things, however, it also raised a question. Especially I'm interested in the way the actual retrieval from disk is done? Is it done in a lazy fashion, fetch will blosc2 upon b2nd_open() create the whole in-memory representation at once? How would I force blosc2 to create the in-memory representation at once in the former case?

The file reading is actually multithreaded because the blocks in chunks are read from disk independently, so in principle, there is no need to have multiple files for exercising multithreading

Does that mean I don't have to do multithreading by myself because it is done by the library? I saw that there are multiple graphs showing the benefits of multithreading, but I'm wondering if they're showing the nthreads parameter to the d/cparams or do they depict dedicated threads spun up by the user that read concurrently from the array?

Thanks for your help so far!

FrancescAlted · 2024-02-27T11:45:28Z

FrancescAlted
Feb 27, 2024
Maintainer

Thanks you're golden! The article was a very interesting read - and very impressive. I think we could also benefit from a similar approach. It also raised a question:

The paper talks about a zero detection mechanism in blosc2, yet you mention that it also works for other values. Are mechanisms different here, e.g. is the zero detection pipeline independent of the compression codec (which might or might not compress recurrences of the same number efficiently)? I'm asking because we have meteorological data with a lot of -999..

Oops, I provided the wrong answer before (I have edited and corrected my previous reply). The repeated detection mechanism only works for bytes, not for general values. So, runs of zeros can be detected because all the bytes in 0 values are the same (0x0), and this is why the paper talks about zero detection mechanism. This also works for other values, e.g. int16 values expressed as -0xf0f0 (-61680) will be detected and optimized out. However, this obviously does not work for NaNs or other values that don't have all bytes repeated. Having said that, a chunk full of NaNs will still be compressed very well by the codec; it is just that these values have to pass through the full blosc2 pipeline for compressing and decompressing, and this not as efficient as values that are made of repeated bytes.

Reading the python API cleared up many things, however, it also raised a question. Especially I'm interested in the way the actual retrieval from disk is done? Is it done in a lazy fashion, fetch will blosc2 upon b2nd_open() create the whole in-memory representation at once? How would I force blosc2 to create the in-memory representation at once in the former case?

Yes, you can b2nd_open() a super-chunk on-disk, and for loading chunks from it you can use either a blosc2_schunk_get_chunk() or a blosc2_schunk_get_lazychunk(). The former reads the whole chunk, while the latter reads just metainfo of the chunk, not data. Currently, a lazy chunk can only be used by blosc2_decompress_ctx and blosc2_getitem_ctx. blosc2_decompress_ctx can read and decompress the whole chunk using multithreading; on the other hand, blosc2_getitem_ctx can read part of the chunk, but it is not multithreaded. You decide which is best for you.

The file reading is actually multithreaded because the blocks in chunks are read from disk independently, so in principle, there is no need to have multiple files for exercising multithreading

Does that mean I don't have to do multithreading by myself because it is done by the library? I saw that there are multiple graphs showing the benefits of multithreading, but I'm wondering if they're showing the nthreads parameter to the d/cparams or do they depict dedicated threads spun up by the user that read concurrently from the array?

Exactly, blosc2 is in charge of multithreading for compressing, decompressing and/or I/O (in case of reading and decompressing full data chunks from disk). And yes, the plots we normally produce for our docs are using the nthreads parameter in the c/dparams, not user ones.

Thanks for your help so far!

No problem!

1 reply

MartyMcFlyInTheSky Feb 27, 2024
Author

Thank you very much again, that was very helpful. I think that answers my questions for now, I'll see what I can do with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store and fast load weather data #590

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Store and fast load weather data #590

MartyMcFlyInTheSky Feb 23, 2024

Replies: 3 comments · 5 replies

FrancescAlted Feb 26, 2024 Maintainer

MartyMcFlyInTheSky Feb 26, 2024 Author

FrancescAlted Feb 26, 2024 Maintainer

MartyMcFlyInTheSky Feb 26, 2024 Author

FrancescAlted Feb 26, 2024 Maintainer

MartyMcFlyInTheSky Feb 27, 2024 Author

FrancescAlted Feb 27, 2024 Maintainer

MartyMcFlyInTheSky Feb 27, 2024 Author

MartyMcFlyInTheSky
Feb 23, 2024

Replies: 3 comments 5 replies

FrancescAlted
Feb 26, 2024
Maintainer

MartyMcFlyInTheSky Feb 26, 2024
Author

FrancescAlted Feb 26, 2024
Maintainer

MartyMcFlyInTheSky Feb 26, 2024
Author

FrancescAlted
Feb 26, 2024
Maintainer

MartyMcFlyInTheSky Feb 27, 2024
Author

FrancescAlted
Feb 27, 2024
Maintainer

MartyMcFlyInTheSky Feb 27, 2024
Author