Store and fast load weather data #590
Replies: 3 comments 5 replies
-
Hi! There should be plenty of ways on how you can use blosc2 to address your needs, but I am afraid that we would need some more info for better helping you. Do you plan to use Python, or just plain C? How much data are we talking about? MBs? GBs? TBs? Do you need to mostly retrieve columns or rows of data? Provided this sort of context is always useful before directing you towards one direction or another. |
Beta Was this translation helpful? Give feedback.
-
The normal thing to do is to build a super-chunk and append chunks once at a time (for example via blosc2_schunk_append_buffer()). You have a complete example here: https://github.com/Blosc/c-blosc2/blob/1dd1e55cb329d01c210da77ceb53027853c35b72/bench/trunc_prec_schunk.c As for providing multidimensional metadata, this is achieved passing the Note that, although most of our teaching material is Python oriented, it is still worth reading it for understanding how all the process works. Once you get familiar with the Python-way, you will better grasp the C API for Blosc2.
Yeah, that's the rule of thumb, but in general this requires some experimentation and knowledge about your data, but most specially, your data retrieval patterns (which is the part that you are trying to optimize). A good starting point would be to have a look at these two blogs: https://www.blosc.org/posts/blosc2-ndim-intro/ (hints for optimizing NDim data) [The latter is for using Blosc2 in combination with HDF5, but provided that you are using tabular data, it might interest you] Providing more hints for your specific case and CPU is out of scope here, but we are offering consulting services to help you on that matter. If interested, please write to [email protected].
Not sure about what you are referring with a 'dictionary' here, but again, experimentation is king. FWIW, you might want to check our article on how to handle a highly sparse grid of 7.3 TB (representing stars in the Milky Way) with some complementary metadata in tabular form: https://conference.scipy.org/proceedings/scipy2023/pdfs/Francesc_Alted.pdf (article) Those experiments could be useful for your use case.
Multithreading is configurable for compressing ( |
Beta Was this translation helpful? Give feedback.
-
Oops, I provided the wrong answer before (I have edited and corrected my previous reply). The repeated detection mechanism only works for bytes, not for general values. So, runs of zeros can be detected because all the bytes in 0 values are the same (0x0), and this is why the paper talks about zero detection mechanism. This also works for other values, e.g. int16 values expressed as
Yes, you can b2nd_open() a super-chunk on-disk, and for loading chunks from it you can use either a blosc2_schunk_get_chunk() or a blosc2_schunk_get_lazychunk(). The former reads the whole chunk, while the latter reads just metainfo of the chunk, not data. Currently, a lazy chunk can only be used by blosc2_decompress_ctx and blosc2_getitem_ctx.
Exactly, blosc2 is in charge of multithreading for compressing, decompressing and/or I/O (in case of reading and decompressing full data chunks from disk). And yes, the plots we normally produce for our docs are using the nthreads parameter in the c/dparams, not user ones.
No problem! |
Beta Was this translation helpful? Give feedback.
-
We store large amounts of weather data and consider blosc2 to help us achieve faster loading times. The data usually comes in some sort of a time series:
(open data)
There can be multiple parameters per observation, and the observatiosn are of course repeated for each weather station.
So basically what we want to try is to write a dump file that can be read as fast as possible on system restart and contains all the (necessary) timeseries information. Since I'm new to blosc2, where do you suggest me to start?
Beta Was this translation helpful? Give feedback.
All reactions