-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm: WORKDIR files overwritten on multistep-stage specs #311
Comments
Hello, if I may add that this is quite time sensitive for the SCAILFIN project as it is tied to the scalability tests that we promised as a deliverable for the NSF grant. That grant ends at the end of this summer, so we were hoping to do the tests this spring/early summer. @tiborsimko |
Hey there 👋🏻 With the aim of speeding things a bit (and given that I got not response from Lukas), I created a minimum example workflow to debug the described problem. Check it out at Scailfin/reana-slurm-test. Within the repo, you can find instructions on how to run the workflow on Kubernetes and Slurm. Once you do, you will discover that Slurm runs always crash with |
Hi ! I am commenting in this issue because of two things
I have tried to submit the madminer-worrfklow with Kubernetes as backend and Here I post screenshots of the failing |
Hey @irinaespejo I no longer work for CERN/REANA, so I won't be able to provide much help, but I think that using |
Hi @roksys , I am unsure whether that alone would solve the problem. Replacing If I am not mistaken, it would be as running |
This issue describes an undesirable behaviour found within the
SlurmJobManagerCERN
class, discovered between Carl Evans (NYU HPC) and myself (NYU CDS).Context
We are currently trying to run a complex workflow (see madminer-workflow for reference) on REANA
0.7.3
, using SLURM as the computational backend. The workflow specification is written in Yadage, and it is totally functional on REANA0.7.1
, when using Kubernetes as the computational backend.Problem
The problem is found on any Yadage spec. using the multistep-stage
scheduler_type
value (where multiple "step-jobs" are run in parallel), when those "step-jobs" depend on scattered files to perform their computations.In those scenarios, the
SlurmJobManagerCERN._download_dir
function, in addition to be somehow inefficient (it crawls through every file and directory in the SLURM workdir, making each step to scan everything all previous steps created), overrides the whole workflow WORKDIR at the start of each "step-job".We have recently raised concerns about this behaviour on the REANA Mattermost channel (precisely here) where we thought the problem was due to the
publisher_type
within the Yadage specification. Turns out that was not the case, but instead it is due to thescheduler_type
multistep-stage value.Testing
We did some preliminary testing to properly identify the scope of the issue.
We are fairly sure the issue is located within the
SlurmJobManagerCERN._download_dir
function, as we have performed some testing on a customreana-job-controller
Docker image (where we have tuned this function and hardcoded some paths to our needs), and we were able to run the complete workflow successfully ✅Possible solution
We believe a good patch would involve reducing the scope of the
SlurmJobManagerCERN._download_dir
function WORKDIR copying procedure, from the "workflow" level to the "step-job" level. That way, there will not be any overriding problems among parallel "step-jobs" within the same workflow stage.Additional clarifications
This issue has not being detected in any of the workflows you guys use for testing because none of them use multistep-stage
scheduler_type
values, involving files. See:@lukasheinrich offered to create a dummy workflow to test this behaviour, but no progress has been done so far (message).
The text was updated successfully, but these errors were encountered: