-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newer dask/distributed versions take very long to start computation. #2
Comments
I just ran the example again and the cluster started showing activity fairly quickly, so I am not entirely sure the above example is the best to expose this behavior, but perhaps you still have an idea what could be causing this? |
Thanks for the nice example snippet @jbusecke!
Nothing immediately pops out to me. But there has been a lot of recent work on I tried the example locally on my laptop with two changes:
using A couple of things immediately come to mind:
To test whether or not transmitting the graph to the scheduler is a large issue, could you try turning off low-level task fusion? Instead of ds.to_zarr(mapper) do with dask.config.set({"optimization.fuse.active": False}):
ds.to_zarr(mapper) with should hopefully result in a much smaller graph getting sent over the wire to the scheduler. Additionally, I see there have been recent releases of |
Stepping back a bit, I suspect that there will be times in the future when you encounter issues when running on pangeo resources and it will be useful for others to try and reproduce them. Does Pangeo have any publicly accessible resources we could use to try and reproduce the issues you run into? I know there's Pangeo cloud and Pangeo's binderhub, but I don't have a good sense for if these are appropriate for this use case |
Also cc @ian-r-rose |
I did run these on the Pangeo Cloud. It only requires a sign up. This would be a good place for all of us to be able to have the same playing field? |
Thank you very much for the suggestions. Will try them now. |
Great! I'll sign up now. Time to dust of my old ORCID... |
Using with dask.config.set({"optimization.fuse.active": False}):
ds.to_zarr(mapper) indeed cut the wait time down from ~4 min to less than 1 min! Ill try to check that in my full blown workflow to see if this has a similar effect. |
I'm curious about the tradeoffs of bypassing optimizations. They might make the computation start faster...but will it run slower? |
I didnt run them to completion, but will now 😁 |
This is a great question to ask! In general things will be slower. Specifically, here are all the array optimizations that are skipped when Either moving these optimizations to be at the I was mostly interested in turning off |
The biggest thing that you'll miss from losing fusion is probably slicing
fusion. Dask/Xarray co-evolved a lot of logic to allow slicing on
HDF5/NetCDF files to only read in what was necessary. If you're doing
full-volume or at least full-chunk data processing then I don't think that
you're likely to miss much.
…On Tue, Mar 2, 2021 at 5:25 PM James Bourbeau ***@***.***> wrote:
They might make the computation start faster...but will it run slower?
This is a great question to ask! In general things will be slower.
Specifically, here are all the array optimizations that are skipped
<https://github.com/dask/dask/blob/8663c6b7813fbdcaaa85d4fdde04ff42b1bb6ed0/dask/array/optimization.py#L53-L76>
when "optimization.fuse.active" is turned off. Exactly how much slower
things are depends on the particular computation -- though I suspect the
last optimization, optimize_slices, is particularly useful for common
Xarray workloads.
Either moving these optimizations to be at the HighLevelGraph level
(similar to the cull optimization here
<https://github.com/dask/dask/blob/8663c6b7813fbdcaaa85d4fdde04ff42b1bb6ed0/dask/highlevelgraph.py#L801>),
or removing the need for a particular optimization altogether with
improvements in the distributed scheduler, are part of the ongoing
scheduler performance improvement effort (Matt gave a recent talk on this
topic <https://www.youtube.com/watch?v=vZ3R1DxTwbA&t> and here's a blog
post <https://blog.dask.org/2020/07/21/faster-scheduling> which outlines
the main parts of these efforts). Ultimately we want to remove the need for
the "optimization.fuse.active" config option, but we're not there yet.
I was mostly interested in turning off "optimization.fuse.active" to get
a sense for how much of a bottleneck graph transmission from the client to
the scheduler is or isn't.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFO3LG6CM53ADDP4RDTBVXVHANCNFSM4YPUFQ2Q>
.
|
I'm curious, does this problem still persist when turning off fusion? If there is something else going on here then I'd like to get to the bottom of it. If not then I would encourage this group to start operating without fusion (I think that you'll be ok) and we can work towards making that the default on our end. |
I havent had time to get back to those test cases yet. In other workflows I have not really noticed this anymore, but Ill try to confirm soonish (backed up by paper revisions this/next week). |
Hi everyone,
here is a relatively recent issue that puzzles me (and prevents me from upgrading to the latest dask/distributed versions).
For large computations, it can take very long until a any computation is "started", as judged from nothing happening in the task stream/ProgressBar(for threaded scheduler).
This example (which mimics part of my typical workload with high-resolution ocean model output) for example, has been showing nothing for several minutes now (It has been about 8-10 minutes at the point of writing this).
Then I set up an adaptive dask gateway cluster
I am then trying to write this to the pangeo scratch bucket
I am running this on the pangeo google deployment with a "Large" server.
My versions are:dask:2021.01.1 distributed:2021.01.1
I realize that these datasets are quite large, but they are by no means unrealistic for modern climate/earth system models.
I originally noticed this behavior in one of my research projects when I upgraded from
2020.12.0
version to the latest release (I believe2021.02.x
), and it led me to manually downgrade to get my workflow running since nothing would happen even after I waited for 30+ minutes.The text was updated successfully, but these errors were encountered: