-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dropped connections #64
Comments
@wjcunningham7 I think the issue we were discussing yesterday is either related to or is this one. The plan is: (1) two new constructor inputs for (2) Add reconnect attempts in the polling loop. Looking at lines 511-535 in async def _poll_slurm(self, job_id: int, conn: asyncssh.SSHClientConnection) -> None:
"""Poll a Slurm job until completion.
Args:
job_id: Slurm job ID.
conn: SSH connection object.
Returns:
None
"""
# Poll status every `poll_freq` seconds
status = await self.get_status({"job_id": str(job_id)}, conn)
while (
"PENDING" in status
or "RUNNING" in status
or "COMPLETING" in status
or "CONFIGURING" in status
):
await asyncio.sleep(self.poll_freq)
status = await self.get_status({"job_id": str(job_id)}, conn)
if "COMPLETED" not in status:
raise RuntimeError("Job failed with status:\n", status) I assume there is something we can take from async def _client_connect(self) -> asyncssh.SSHClientConnection:` |
Further to the above, the thing to track is the output of |
What should we add?
If the server connection is halted, in-progress workflows remain "running" indefinitely. A nice feature would be add some sort of support for dropped connections or server restarts.
From Will:
Tagging @utf who had the question/suggestion in the first place.
Describe alternatives you've considered.
You can redispatch if needed.
The text was updated successfully, but these errors were encountered: