-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load tiles in parallel on workers and add options to TissueDetectionHE
#336
base: dev
Are you sure you want to change the base?
Conversation
This looks awesome - glad that you've been able to make headway on figuring out the performance issues with dask!! My only Q is that right now we have the |
There's no issue with using >>> import dask
>>> delayed_value = dask.delayed(1)
>>> delayed_value.compute()
1 Use of I've dug a bit more into the memory leaks and there are still issues with this PR. The only reliable way I've found to clear out memory on workers is to restart the client as suggested in Aggressively Clearing Data. This will at least prevent memory leaks from one slide affecting the processing of the next slide, but large slides may still run into memory issues only while processing that large slide. I left a TODO where I think changes should happen to free memory as tiles are processed, along with the some of the attempts that did not work. # TODO: Free memory used for tile
# all of these still leave unmanaged memory on each worker
# Each in-memory future holding a Tile shows a size of 48 bytes on the Dask dashboard
# which clearly does not include image data.
# Could it be that loaded image data is somehow not being garbage collected with Tiles?
# future.release()
# future.cancel()
# del result
# del future This PR still provides improvements from loading tiles in parallel, but these changes are not yet enough to address the memory leaks identified in #211 and #299. |
Now that we merged "merge-master" into dev (#334 ), should we change this PR to merge into dev? |
Just deleted the merged branch which automatically switches these PRs to merge into dev |
This contains two separate improvements
drop_empty_tiles
andkeep_mask
options to theTissueDetectionHE
transform to bypass saving tiles with no detected H&E tissue and bypass saving masksdask.delayed
to avoid loading images on the main threadThe first part is both for convenience and performance. It's possible to generate all tiles and then filter out the empty tiles and remove masks before writing the h5path to disk, but that requires that all the tiles be added to the Tiles which takes IO time. If these tiles and masks are never saved even to in-memory objects, processing can finish faster.
The second part is a core performance issue with distributed processing. I believe it's relevant to #211 and #299. When processing tiles, I've found that loading time >> processing time, and currently, tile image data is loaded on the main thread and scatters the loaded tile to workers. This prevents any parallelism as all but one worker are always waiting for the main thread to load data and send them a tile.
Additionally, as all tiles have to be loaded on the main thread, the block that generates the futures
has to load all tiles and send them all to workers before ANY tile can be added to the
Tiles
and the memory can be freed in the next blockcausing the dramatic memory leaks seen in #211.
I've used
dask.delayed
to prevent reading from the input file until the image is accessed on the worker. The code that accesses the file and loads the image can now be run by each worker in parallel. To preserve the parallelism, we have to take care not to access and loadtile.image
on the main thread before loading it on the worker, or to at least wrap accesses indask.delayed
as inSlideData.generate_tiles
.I had some issues with the backends not being picklable. The Backend has to be sent to each worker so it has access to the code that interfaces with the filesystem. I changed Backend filelike attributes to be lazily evaluated with the @Property decorator.