Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consumer: remove pods for deleted workflows #445

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

VMois
Copy link

@VMois VMois commented Apr 13, 2022

closes #437

@codecov-commenter
Copy link

Codecov Report

Merging #445 (38dafac) into master (ae32eb8) will decrease coverage by 0.66%.
The diff coverage is 6.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
- Coverage   68.22%   67.56%   -0.67%     
==========================================
  Files          15       15              
  Lines        1328     1341      +13     
==========================================
  Hits          906      906              
- Misses        422      435      +13     
Impacted Files Coverage Δ
reana_workflow_controller/consumer.py 35.75% <6.66%> (-2.81%) ⬇️

@VMois VMois force-pushed the handle-deleted-workflows branch from 38dafac to 7e8674f Compare April 19, 2022 08:40
f"Could not clean up not alive workflow {workflow.id_} batch pod for workflow."
f" Error: {exception}"
)
_delete_workflow_jobs(workflow)
Copy link
Author

@VMois VMois Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a problem in this line with deciding whether to delete workflow jobs or not for two reasons:

(1) Theoretically, it is possible that when the workflow is deleted but there are a lot of messages in the jobs-status queue, its deletion will be delayed. And during this delay, the workflow pod can start executing jobs. But, I have a hard time coming up with a test scenario (unit or manual).

(2) In addition, I am not sure if this is the right place to delete jobs. It would make more sense to delete them on a shutdown of the job-controller pod. But it will require investigation on how to implement it in job-controller

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are only addressing the NotReady situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?

The situation of surviving workflow step pods happened in the past in other kind of troubles that were not related to NotReady. I recall one recent example where workflow step pods survived in the Completed status but there was no workflow batch pod anymore. That is, the workflow batch pod apparently terminated well, but some of the workflow step pods still remained in Completed status. Have you tried to look at this situation too?

FWIW I tend to agree with you about the danger of deleting things for all not ALIVE_STATUSES because that could include ones we don't want perhaps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about not deleting not ALIVE_STATUSES. I should only delete workflows that have deleted status

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to look at this situation too?

Do we have an issue describing in detail this situation? I couldn't find it.

I recall one recent example where workflow step pods survived in the Completed status but there was no workflow batch pod anymore.

Step jobs are removed in job-controller when they are finished. It probably means something failed in job-controller. There is a possibility to automate the step job deletion process in Kubernetes - https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#ttl-after-finished-controller . But this is a discussion for a new issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue describing in detail this situation? I couldn't find it.

E.g. the same as reanahub/reana-job-controller#326. That issue is closed, but I saw the same problem also recently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are only addressing the NotReady situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?

This is a tricky question. First of all, the issue with hanging step jobs is unlikely. So we are talking about an edge case here.

In the situation above, If you wait long enough the only thing that would be left hanging is a workflow engine pod. In reality, before that, all step jobs would run, finish, and get deleted. So it wastes the resources of the cluster.

@@ -281,7 +291,7 @@ def _update_job_cache(msg):
Session.add(cached_job)


def _delete_workflow_job(workflow: Workflow) -> None:
def _delete_workflow_batch_pod(workflow: Workflow) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please beware of the terminology when renaming. We are deleting Kuberentes "job" here in this function, not only "pod", when we are doing delete_namespaced_job(). So this rename wouldn't be fully justified.

When a workflow starts, it creates a Kubernetes job named such as 'reana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edbwhich will create a Kubernetes pod named such asreana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edb-hz46p` where the workflow engine runs.

It is the latter that can get to NotReady status situations. When this happens, the manual solution is to delete the former (=job), which will then automatically delete also the latter (=pod). (If we would have deleted only the batch pod, then the kubernetes job would start a new batch pod.)

Perhaps the terminology is confusing since we have both Kubernetes job and pod, and workflow engine job and pod, and that workflow engine pod can generate new workflow step jobs and pods... Needs some renaming care....

Copy link
Author

@VMois VMois Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It is not a final naming here. _delete_workflow_batch_job would be a better name or even _delete_workflow_engine_batch_job.

And, I would agree terminology is confusing.

@VMois
Copy link
Author

VMois commented Apr 20, 2022

I managed to reproduce a scenario where step jobs might get left hanging even if you remove the workflow engine pod (when the workflow is deleted).

With this PR in check, you will need to:

  1. Add sleep, at the end of on_message function, inside the try/catch:
import time
time.sleep(5)
  1. For example, in the root6-roofit example, add sleep at the beginning of the gendata step:
sleep 60 && ...
  1. Start one workflow to delay a job-status queue:
reana-client run -w w-1
  1. After 4-5 seconds, start a second workflow, for this workflow step jobs will hang:
reana-client run -w w-2
  1. After the second workflow starts, wait 5-10 seconds (or until you see that the first job for this workflow started); delete workflow:
reana-client delete -w w-2
  1. Monitor kubectl get pods, you will have one job left hanging out.

Why does this happen?

Imagine that we only delete workflow engine Jobs if we detect deleted workflow in the consumer. And that the job-status queue is busy with a lot of messages.

The user starts a workflow, but it got stuck in "pending" mode. After some time, the user deletes a workflow. Now it has a "deleted" value in DB. But, in the background, the workflow engine Job is not deleted.

When. ready, the workflow engine Job will start and send its first message to the "jobs-status" queue. The consumer will not process this message anytime soon because the queue is busy. It also means that we cannot delete the workflow engine pod soon. Because of that, the workflow engine pod will start step jobs related to the workflow. Let's assume those steps will run for a long time.

Later, when the first message arrives in the consumer, it will delete the workflow engine Job. But, step job(s) are still running and no one will delete them.

This is a textual explanation for now.

@VMois
Copy link
Author

VMois commented Apr 20, 2022

I have just realized that deleting step jobs in the consumer is a bad idea. Reason: job-controller supports more compute_backends than just Kubernetes like HTCondor and Slurm. If we want to delete step jobs in the consumer, I assume, we will need to handle different backends too. It creates a mess. WDYT? cc: @tiborsimko

A better idea could be to make the job-controller responsible for cleaning its job on shutdown.

@VMois VMois force-pushed the handle-deleted-workflows branch from 7e8674f to 295f522 Compare April 21, 2022 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job-status-consumer: improve handling of "not alive" workflows
3 participants