-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pickable Interface to save state #24
Comments
Hello @mohammadhzp , right now xxhash object isn't pickable. I agree with you it's necessary to provide save/load APIs for xxhash state |
I'd would also really like this feature. My use case is that I'm doing a partial hash of files. I want to hash the first N bytes of each file, then I want to continue hashing the contents of the files that have hash collisions. It is much faster to use xxhash in parallel processes when files live on an HDD. Benchmark: from os.path import join, basename, exists
import ubelt as ub
import random
import string
def _demodata_files(dpath=None, num_files=10, pool_size=3, size_pool=None):
def _random_data(rng, num):
return ''.join([rng.choice(string.hexdigits) for _ in range(num)])
def _write_random_file(dpath, part_pool, size_pool, rng):
namesize = 16
# Choose 1, 4, or 16 parts of data
num_parts = rng.choice(size_pool)
chunks = [rng.choice(part_pool) for _ in range(num_parts)]
contents = ''.join(chunks)
fname_noext = _random_data(rng, namesize)
ext = ub.hash_data(contents)[0:4]
fname = '{}.{}'.format(fname_noext, ext)
fpath = join(dpath, fname)
with open(fpath, 'w') as file:
file.write(contents)
return fpath
if size_pool is None:
size_pool = [1, 4, 16]
dpath = ub.ensure_app_cache_dir('pfile/random')
rng = random.Random(0)
# Create a pool of random chunks of data
chunksize = 65536
part_pool = [_random_data(rng, chunksize) for _ in range(pool_size)]
# Write 100 random files that have a reasonable collision probability
fpaths = [_write_random_file(dpath, part_pool, size_pool, rng)
for _ in ub.ProgIter(range(num_files), desc='write files')]
for fpath in fpaths:
assert exists(fpath)
return fpaths
def benchmark():
import timerit
import ubelt as ub
from kwcoco.util.util_futures import JobPool # NOQA
ti = timerit.Timerit(3, bestof=1, verbose=2)
max_workers = 4
# Choose a path to an HDD
dpath = ub.ensuredir('/raid/data/tmp')
fpath_demodata = _demodata_files(dpath=dpath, num_files=1000,
size_pool=[10, 20, 50], pool_size=8)
for timer in ti.reset('hash_file(hasher=xx64)'):
with timer:
for fpath in fpath_demodata:
ub.hash_file(fpath, hasher='xx64')
for timer in ti.reset('hash_file(hasher=xxhash) - serial'):
# jobs = JobPool(mode='thread', max_workers=2)
jobs = JobPool(mode='serial', max_workers=max_workers)
with timer:
for fpath in fpath_demodata:
jobs.submit(ub.hash_file, fpath, hasher='xxhash')
results = [job.result() for job in jobs.jobs]
for timer in ti.reset('hash_file(hasher=xxhash) - thread'):
# jobs = JobPool(mode='thread', max_workers=2)
jobs = JobPool(mode='thread', max_workers=max_workers)
with timer:
for fpath in fpath_demodata:
jobs.submit(ub.hash_file, fpath, hasher='xx64')
results = [job.result() for job in jobs.jobs]
for timer in ti.reset('hash_file(hasher=xxhash) - process'):
# jobs = JobPool(mode='thread', max_workers=2)
jobs = JobPool(mode='process', max_workers=max_workers)
with timer:
for fpath in fpath_demodata:
jobs.submit(ub.hash_file, fpath, hasher='xx64')
results = [job.result() for job in jobs.jobs]
if __name__ == '__main__':
benchmark() Results are:
The Of course this does not account for the overhead that would be required to pickle an existing hasher, but I think that should be minimal compared to the cost of reading files in a single loop. If an instance of a hasher object was able to return some sort of pickleable "state" object, and then be able to create an new instance of a hasher object at that sate, then that would be sufficient to achieve pickleability. Looking at the code, I think this would involve exposing For my reference I'm copying info about the xxhash state If found. It looks like its somewhat complicated. I'm not sure how straightforward it is to expose a pickleable version in Python and also provide binding to access / create a new hasher from a state. But it is just a collection of 32-bit integers, so it should be possible. typedef uint32_t XXH32_hash_t;
struct XXH32_state_s {
XXH32_hash_t total_len_32; /*!< Total length hashed, modulo 2^32 */
XXH32_hash_t large_len; /*!< Whether the hash is >= 16 (handles @ref total_len_32 overflow) */
XXH32_hash_t v1; /*!< First accumulator lane */
XXH32_hash_t v2; /*!< Second accumulator lane */
XXH32_hash_t v3; /*!< Third accumulator lane */
XXH32_hash_t v4; /*!< Fourth accumulator lane */
XXH32_hash_t mem32[4]; /*!< Internal buffer for partial reads. Treated as unsigned char[16]. */
XXH32_hash_t memsize; /*!< Amount of data in @ref mem32 */
XXH32_hash_t reserved; /*!< Reserved field. Do not read or write to it, it may be removed. */
}; /* typedef'd to XXH32_state_t */ |
OK, I'll try to add this feature in my spare time. |
If you want help I can also donate some time to this. If you lay out the general structure, I can help with filling things out and testing. |
@Erotemic glad you can help, welcome! At the python side, we need to implement |
Hmm, so I think this is short-term doable without a public API provide by xxHash, but that means any xxHash implementation change might break the bindings. It might be worth submitting an issue there so they can provide us with a public API that is guaranteed to not change. Assuming either a public API that returns the packed Then to implement Does this seem reasonable? Am I missing any C-specific details? |
it seems reasonable to me.
I remember that xxHash has changed it's state from concrete struct to opaque struct for a very long time. So I don't think there will be a public API. The packed |
Just chiming in that a pickleable xxhash object would be very useful. Thank you for working on this important feature. |
Hello, Is it possible to make xxhash pickable ? how ?
It really will be great if it was pickable, that way we could've save hash state and resume hashing later on
I spend a few hours on this but no luck
The text was updated successfully, but these errors were encountered: