-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consumer: remove pods for deleted workflows #445
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #445 +/- ##
==========================================
- Coverage 68.22% 67.56% -0.67%
==========================================
Files 15 15
Lines 1328 1341 +13
==========================================
Hits 906 906
- Misses 422 435 +13
|
38dafac
to
7e8674f
Compare
f"Could not clean up not alive workflow {workflow.id_} batch pod for workflow." | ||
f" Error: {exception}" | ||
) | ||
_delete_workflow_jobs(workflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a problem in this line with deciding whether to delete workflow jobs or not for two reasons:
(1) Theoretically, it is possible that when the workflow is deleted but there are a lot of messages in the jobs-status
queue, its deletion will be delayed. And during this delay, the workflow pod can start executing jobs. But, I have a hard time coming up with a test scenario (unit or manual).
(2) In addition, I am not sure if this is the right place to delete jobs. It would make more sense to delete them on a shutdown of the job-controller
pod. But it will require investigation on how to implement it in job-controller
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are only addressing the NotReady
situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?
The situation of surviving workflow step pods happened in the past in other kind of troubles that were not related to NotReady
. I recall one recent example where workflow step pods survived in the Completed
status but there was no workflow batch pod anymore. That is, the workflow batch pod apparently terminated well, but some of the workflow step pods still remained in Completed
status. Have you tried to look at this situation too?
FWIW I tend to agree with you about the danger of deleting things for all not ALIVE_STATUSES
because that could include ones we don't want perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about not deleting not ALIVE_STATUSES
. I should only delete workflows that have deleted
status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to look at this situation too?
Do we have an issue describing in detail this situation? I couldn't find it.
I recall one recent example where workflow step pods survived in the Completed status but there was no workflow batch pod anymore.
Step jobs are removed in job-controller
when they are finished. It probably means something failed in job-controller
. There is a possibility to automate the step job deletion process in Kubernetes - https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#ttl-after-finished-controller . But this is a discussion for a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an issue describing in detail this situation? I couldn't find it.
E.g. the same as reanahub/reana-job-controller#326. That issue is closed, but I saw the same problem also recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are only addressing the NotReady situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?
This is a tricky question. First of all, the issue with hanging step jobs is unlikely. So we are talking about an edge case here.
In the situation above, If you wait long enough the only thing that would be left hanging is a workflow engine pod. In reality, before that, all step jobs would run, finish, and get deleted. So it wastes the resources of the cluster.
@@ -281,7 +291,7 @@ def _update_job_cache(msg): | |||
Session.add(cached_job) | |||
|
|||
|
|||
def _delete_workflow_job(workflow: Workflow) -> None: | |||
def _delete_workflow_batch_pod(workflow: Workflow) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please beware of the terminology when renaming. We are deleting Kuberentes "job" here in this function, not only "pod", when we are doing delete_namespaced_job()
. So this rename wouldn't be fully justified.
When a workflow starts, it creates a Kubernetes job named such as 'reana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edbwhich will create a Kubernetes pod named such as
reana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edb-hz46p` where the workflow engine runs.
It is the latter that can get to NotReady
status situations. When this happens, the manual solution is to delete the former (=job), which will then automatically delete also the latter (=pod). (If we would have deleted only the batch pod, then the kubernetes job would start a new batch pod.)
Perhaps the terminology is confusing since we have both Kubernetes job and pod, and workflow engine job and pod, and that workflow engine pod can generate new workflow step jobs and pods... Needs some renaming care....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It is not a final naming here. _delete_workflow_batch_job
would be a better name or even _delete_workflow_engine_batch_job
.
And, I would agree terminology is confusing.
I managed to reproduce a scenario where step jobs might get left hanging even if you remove the workflow engine pod (when the workflow is deleted). With this PR in check, you will need to:
import time
time.sleep(5)
sleep 60 && ...
Why does this happen? Imagine that we only delete workflow engine Jobs if we detect The user starts a workflow, but it got stuck in "pending" mode. After some time, the user deletes a workflow. Now it has a "deleted" value in DB. But, in the background, the workflow engine Job is not deleted. When. ready, the workflow engine Job will start and send its first message to the "jobs-status" queue. The consumer will not process this message anytime soon because the queue is busy. It also means that we cannot delete the workflow engine pod soon. Because of that, the workflow engine pod will start step jobs related to the workflow. Let's assume those steps will run for a long time. Later, when the first message arrives in the consumer, it will delete the workflow engine Job. But, step job(s) are still running and no one will delete them. This is a textual explanation for now. |
I have just realized that deleting step jobs in the consumer is a bad idea. Reason: A better idea could be to make the |
7e8674f
to
295f522
Compare
closes #437