cuSZp is an ultra-fast and user-friendly GPU error-bounded lossy compressor for floating-point data array (both single- and double-precision). In short, cuSZp has several key features:
- Fusing entire compression/decompression phase into one CUDA kernel function.
- Efficient latency control and memory access -- targeting ultra-fast end-to-end throughput.
- Two encoding modes (plain or outlier modes) supported, high compression ratio for different data patterns. In general, if your dataset is sparse (consisting lots of 0s) -- plain mode will be a good choice; if your dataset exhibits non-sparse and high smoothness -- outlier mode will be a good choice.
- Executable binary, C/C++ API, Python API are provided.
- Linux OS with NVIDIA GPUs
- Git >= 2.15
- CMake >= 3.21
- Cuda Toolkit >= 11.0
- GCC >= 7.3.0
You can compile and install cuSZp with following commands.
$ git clone https://github.com/szcompressor/cuSZp.git
$ cd cuSZp
$ mkdir build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install/ ..
$ make -j
$ make install
After installation, you will see executable binary generated in cuSZp/install/bin/
and shared/static library generated in cuSZp/install/lib/
.
You can also try ctest -V
in cuSZp/build/
folder (Optional).
After installation, you will see cuSZp
binary generated in cuSZp/install/bin/
. The detailed usage of this binary are explained in the code block below.
Usage:
./cuSZp -i [oriFilePath] -t [dataType]
-m [encodingMode] -eb [errorBoundMode] [errorBound]
-x [cmpFilePath] -o [decFilePath]
Options:
-i [oriFilePath] Path to the input file containing the original data
-t [dataType] Data type of the input data. Options:
f32 : Single precision (float)
f64 : Double precision (double)
-m [encodingMode] Encoding mode to use. Options:
plain : Plain fixed-length encoding mode
outlier : Outlier fixed-length encoding mode
-eb [errorBoundMode] [errorBound]
errorBoundMode can only be:
abs : Absolute error bound
rel : Relative error bound
errorBound is a floating-point number representing the error bound, e.g. 1E-4, 0.03.
-x [cmpFilePath] Path to the compressed output file (optional)
-o [decFilePath] Path to the decompressed output file (optional)
Some example commands can be found here:
./cuSZp -i pressure_3000 -t f32 -m plain -eb abs 1E-4 -x pressure_3000.cuszp.cmp -o pressure_3000.cuszp.dec
./cuSZp -i ccd-tst.bin.d64 -t f64 -m outlier -eb abs 0.01
./cuSZp -i velocity_x.f32 -t f32 -m outlier -eb rel 0.01 -x velocity_x.f32.cuszp.cmp
./cuSZp -i xx.f32 -m outlier -eb rel 1e-4 -t f32
Some results measured on one NVIDIA A100 GPU can be shown as below:
$ ./cuSZp -i pressure_2000 -t f32 -m plain -eb rel 1e-3
cuSZp finished!
cuSZp compression end-to-end speed: 410.416846 GB/s
cuSZp decompression end-to-end speed: 627.771706 GB/s
cuSZp compression ratio: 22.328032
Pass error check!
$ ./cuSZp -i xx.f32 -t f32 -m outlier -eb rel 1e-4
cuSZp finished!
cuSZp compression end-to-end speed: 341.494199 GB/s
cuSZp decompression end-to-end speed: 412.994599 GB/s
cuSZp compression ratio: 6.277573
Pass error check!
$ ./cuSZp -i acd-tst.bin.d64 -t f64 -m outlier -eb rel 0.0001
cuSZp finished!
cuSZp compression end-to-end speed: 552.982586 GB/s
cuSZp decompression end-to-end speed: 550.156870 GB/s
cuSZp compression ratio: 13.737053
Pass error check!
You will also see two binaries generated this folder, named cuSZp_test_f32
and cuSZp_test_f64
. They are used for functionality test and can be executed by ./cuSZp_test_f32
and ./cuSZp_test_f64
.
If you want to use cuSZp as a C/C++ interal API, there are two ways.
-
Use the cuSZp generic API.
// Other headers. #include <cuSZp.h> int main (int argc, char* argv[]) { // Other code. // For measuring the end-to-end throughput. TimingGPU timer_GPU; // cuSZp compression for device pointer. timer_GPU.StartCounter(); // set timer cuSZp_compress(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, dataType, encodingMode, stream); float cmpTime = timer_GPU.GetCounter(); // cuSZp decompression for device pointer. timer_GPU.StartCounter(); // set timer cuSZp_decompress(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, dataType, encodingMode, stream); float decTime = timer_GPU.GetCounter(); // Other code. return 0; }
Here,
d_oriData
,d_cmpBytes
, andd_decData
are device pointers (array on GPU), representing original data, compressed byte, and reconstructed data, respectively.dataType
andencodingMode
can be defined as below:cuszp_type_t dataType = CUSZP_TYPE_FLOAT; // or CUSZP_TYPE_DOUBLE cuszp_mode_t encodingMode = CUSZP_MODE_PLAIN; // or CUSZP_MODE_OUTLIER
A detailed example can be seen in
cuSZp/examples/cuSZp.cpp
. -
Use a specific encoding mode and floating-point data type (f32 or f64).
#include <cuSZp.h> // Still the only header you need. // Compression and decompression for float type data array using plain mode. cuSZp_compress_plain_f32(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, stream); cuSZp_decompress_plain_f32(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, stream); // In this case, d_oriData and d_Decdata are float* array on GPU.
Other modes and data type usages
#include <cuSZp.h> // Still the only header you need. // Compression and decompression for float type data array using outlier mode. cuSZp_compress_outlier_f32(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, stream); cuSZp_decompress_outlier_f32(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, stream); // In this case, d_oriData and d_Decdata are float* array on GPU. // Compression and decompression for double type data array using plain mode. cuSZp_compress_plain_f64(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, stream); cuSZp_decompress_plain_f64(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, stream); // In this case, d_oriData and d_Decdata are double* array on GPU. // Compression and decompression for double type data array using outlier mode. cuSZp_compress_outlier_f64(d_oriData, d_cmpBytes, nbEle, &cmpSize, errorBound, stream); cuSZp_decompress_outlier_f64(d_decData, d_cmpBytes, nbEle, cmpSize, errorBound, stream); // In this case, d_oriData and d_Decdata are double* array on GPU.
In this case, you do not need to set
cuszp_type_t
andcuszp_mode_t
. More detaild examples can be found incuSZp/examples/cuSZp_test_f32.cpp
andcuSZp/examples/cuSZp_test_f64.cpp
.
cuSZp also supports Python bindings for fast compression on GPU array.
Examples can be found in cuSZp/python/
.
The required Python packages include ctypes, numpy, pycuda
. pytorch
is optional unless you want to use cuSZp compress a torch
tensor.
We provide two examples for showing how cuSZp can be used to compress/decompress a HPC field in numpy format (see python/example-hpc.py
) and a torch tensor (see python/example-torch.py
).
To execute them:
# This shows cuSZp compresses a HPC field with f32 format.
python example-hpc.py ./pressure_3000 f32
# This shows cuSZp compresses a 4 GB f32 torch tensor.
python example-torch.py
cuSZp Python API also preseves very high throughput. Taking compressing and decompressing 4 GB torch tensor on NVIDIA A100 GPU as an example. Similarly, we measure throughput in an end-to-end manner (e.g. following code block shows compression measurement).
compressor = cuSZp()
# cuSZp compression.
start_time = time.time() # set cuSZp timer start
compressed_size = compressor.compress(
ctypes.c_void_p(data.data_ptr()), # Input data pointer on GPU
ctypes.c_void_p(int(d_cmpBytes)), # Output buffer on GPU
data.numel(), # Number of elements
1E-2, # Set 1E-2 as error bound.
data_type=0, # float 32, 1 for float64 (i.e. double)
mode=0 # Plain mode, 1 for outlier mode
)
compression_time = time.time() - start_time # set cuSZp timer end
This throughput measurement can be shown as below:
$ python example-torch.py
Original data size: 4294967296 bytes
Compressed data size: 971519248 bytes
Compression Ratio: 4.42
Compression Throughput: 214.07 GB/s
Decompression Throughput: 345.08 GB/s
Decompressed data matches original within error bound: True
A more detailed documentation with some intrinsic usage descriptions will be updated soon.
cuSZp was developed and contributed by following authors.
- Developers: Yafan Huang (kernels, entries, and examples), Sheng Di (utility).
- Contributors: Franck Cappello, Xiaodong Yu, Robert Underwood, Guanpeng Li.
If you find cuSZp is useful, following two papers can be considered for citing.
- [SC'23] cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance
@inproceedings{huang2023cuszp, title={cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance}, author={Huang, Yafan and Di, Sheng and Yu, Xiaodong and Li, Guanpeng and Cappello, Franck}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, pages={1--13}, year={2023} }
- [SC'24] cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio
@inproceedings{huang2024cuszp2, title={cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio}, author={Huang, Yafan and Di, Sheng and Li, Guanpeng and Cappello, Franck}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, pages={1--18}, year={2024} }
The [SC'23] paper proposes the cuSZp compression framework with kernel fusion (a.k.a. cuSZp1, code see Release or Commit), while the [SC'24] paper includes new lossless encoding modes and several performance optimization (a.k.a. cuSZp2, code see Release or Commit).
(C) 2023 by Argonne National Laboratory and University of Iowa. For more details see COPYRIGHT.
This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR), under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant OAC-2003709, OAC-2104023, and OAC-2311875. We acknowledge the computing resources provided on Bebop (operated by Laboratory Computing Resource Center at Argonne) and on Theta and JLSE (operated by Argonne Leadership Computing Facility). We acknowledge the support of ARAMCO.