Healthchecks and autorestarts for computes #1074

Omrigan · 2024-09-20T14:22:18Z

Problem description / Motivation

At this moment, we can only rely on k8s's signal for compute unavailability, specifically, container process monitoring.

We would like to have an end-to-end healthcheck, which would allow us to detect problems, such as:

We have a healthcheck mechanism, allowing us to detect compute issues within <30s, and taking appropriate actions, such as restarting.

We should have a piece of code inside vm which would respond to a healthcheck.

stradig · 2024-09-23T15:25:50Z

Not sure we will need that or if Kubernetes is good enough. Putting in the backlog for now.