-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options to read & decompress data in parallel #340
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some benchmarks 🐎 I selected AGIPD runs with ~200 cells and loaded the first 1000 trains of a single module into memory.
This is loading a compressed uint16
AGIPD module from p3025 with 200 cells:
(same but with a linear scale)
And loading an uncompressed float32
AGIPD module from p3046 with 202 cells:
(same but with a linear scale)
Amusingly, it's faster to load the uncompressed data despite it being ~80.5GB on disk compared to ~2.34GB of compressed data 🙃 (both proc/
directories still being on GPFS).
But still a huge improvement 🎉
|
||
|
||
# Based on _alloc function in pasha | ||
def zeros_shared(shape, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expose this? Or add something like KeyData.allocate_out(shared=True)
?
|
||
return out | ||
|
||
def ndarray(self, roi=(), out=None, read_procs=1, decomp_threads=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds
read_procs
anddecomp_threads
parameters toKeyData.ndarray()
and.xarray()
. They control the number of processes used to read data from HDF5 files, and threads used to decompress data in the specific pattern we use for gain/mask datasets in 2D data. They both default to 1, i.e. the status quo, and we avoid launching separate processes/threads when they're 1.Testing with ~55 GB of JUNGFRAU data, I got a better-than-2x speedup reading uncompressed data with 10 processes (~1 minute -> ~24 seconds), and something like a 10x speedup reading compressed data with
decomp_threads=-1
, i.e. 1 thread per core, on a 72-core node (~1 min 40 s -> 10 s). The timings are pretty variable - AFAICT, filesystem access always is.The
read_procs
option is kind of incompatible with passing in anout
array, because the array needs to be in shared memory. I'm not sure how to deal with that in the API - we could reject usingout
andread_procs
together, but you could also pass in an array in shared memory, and I don't know of any way to check for that.Future work:
Closes #49.