Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to read & decompress data in parallel #340

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

takluyver
Copy link
Member

This adds read_procs and decomp_threads parameters to KeyData.ndarray() and .xarray(). They control the number of processes used to read data from HDF5 files, and threads used to decompress data in the specific pattern we use for gain/mask datasets in 2D data. They both default to 1, i.e. the status quo, and we avoid launching separate processes/threads when they're 1.

Testing with ~55 GB of JUNGFRAU data, I got a better-than-2x speedup reading uncompressed data with 10 processes (~1 minute -> ~24 seconds), and something like a 10x speedup reading compressed data with decomp_threads=-1, i.e. 1 thread per core, on a 72-core node (~1 min 40 s -> 10 s). The timings are pretty variable - AFAICT, filesystem access always is.

The read_procs option is kind of incompatible with passing in an out array, because the array needs to be in shared memory. I'm not sure how to deal with that in the API - we could reject using out and read_procs together, but you could also pass in an array in shared memory, and I don't know of any way to check for that.

Future work:

Closes #49.

@takluyver takluyver added the enhancement New feature or request label Sep 8, 2022
Copy link
Member

@JamesWrigley JamesWrigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some benchmarks 🐎 I selected AGIPD runs with ~200 cells and loaded the first 1000 trains of a single module into memory.

This is loading a compressed uint16 AGIPD module from p3025 with 200 cells:
image
(same but with a linear scale)
image

And loading an uncompressed float32 AGIPD module from p3046 with 202 cells:
image
(same but with a linear scale)
image

Amusingly, it's faster to load the uncompressed data despite it being ~80.5GB on disk compared to ~2.34GB of compressed data 🙃 (both proc/ directories still being on GPFS).

But still a huge improvement 🎉



# Based on _alloc function in pasha
def zeros_shared(shape, dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expose this? Or add something like KeyData.allocate_out(shared=True)?


return out

def ndarray(self, roi=(), out=None, read_procs=1, decomp_threads=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decomp_threads failed for me when loading an AGIPD module with shape (200000, 512, 128):
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load data from multiple files in parallel?
2 participants