Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Executor health checker that uses signals independent from deployment environment #1212

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

eabatalov
Copy link
Contributor

Current Executor health check policy and reasoning:

  • A Function Executor health check failure is a strong signal that something is wrong either with:
    • The Function Code (a criticial software bug).
    • The Executor machine/container/VM (a software bug or malfunctioning local hardware).
  • Critical Function Code bugs tend to get fixed eventually by users. What doesn't get fixed eventually is rare but recurring local Executor issues like hardware errors and software bugs in middleware like drivers.
  • Such issues tend to get mitigated by automatically recreating the Executor machine/VM/container.
  • So we fail whole Executor health check if a Function Executor health check ever failed to hint the users that we probably need to recreate the Executor machine/VM/container (unless there's a bug in Function code that user can investigate themself).

Also implemented tests for both startup and health probes.

…ployment environment

Current Executor health check policy and reasoning:
* A Function Executor health check failure is a strong signal that something is wrong
  either with:
  - The Function Code (a criticial software bug).
  - The Executor machine/container/VM (a software bug or malfunctioning local hardware).
* Critical Function Code bugs tend to get fixed eventually by users. What doesn't get fixed eventually
  is rare but recurring local Executor issues like hardware errors and software bugs in middleware like
  drivers.
* Such issues tend to get mitigated by automatically recreating the Executor machine/VM/container.
* So we fail whole Executor health check if a Function Executor health check ever failed to hint the users
  that we probably need to recreate the Executor machine/VM/container (unless there's a bug in Function
  code that user can investigate themself).

Also implemented tests for both startup and health probes.
@eabatalov eabatalov marked this pull request as ready for review February 6, 2025 17:52
@eabatalov eabatalov merged commit 9c3527b into main Feb 6, 2025
9 checks passed
@eabatalov eabatalov deleted the executor-health-and-startup-probes branch February 6, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant