Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up self hosted runaways #93

Merged
merged 12 commits into from
Sep 15, 2024

Conversation

dscho
Copy link
Member

@dscho dscho commented Sep 14, 2024

Resurrect the part dropped from #60, where we considered to add a scheduled workflow to clean up self-hosted runners that were not cleaned up as intended.

This PR adds a couple of commits on top, most notably querying GitHub whether the runner in question is still busy with something.

[EDITED] I verified in a manual test that the default scope of $GITHUB_TOKEN is insufficient to run that query. Therefore, we now authenticate the GitHub CLI as a GitHub App to be able to use that query.

[EDITED] Also: I changed the workflow to use notices and warnings to make it easier to spot what the individual workflow runs were doing.

@dennisameling would you kindly have a look?

dennisameling and others added 4 commits November 22, 2023 10:23
…ariable

The image SKU was hidden a bit in the azure-arm-template.json. Let's
make it more visible by turning it into a parameter that the CI pipeline
can provide.

Signed-off-by: Dennis Ameling <[email protected]>
Version 23h2 was released on October 31st, 2023. Let's update to it,
so that the runners benefit from the latest improvements to Windows 11
ARM64.

Signed-off-by: Dennis Ameling <[email protected]>
Sometimes, VMs don't get deleted properly. Let's set up a script to
clean up VMs that were created more than 3 hours ago. This script
runs every 6 hours (cron) and we can also invoke it manually if
needed.

Signed-off-by: Dennis Ameling <[email protected]>
This resurrects a GitHub workflow I had originally asked to keep out of
git-for-windows#60,
when I was concerned that we might delete runner VMs while they are
still busy running a workflow.

However, in the meantime it happened already twice that I had missed a
run-away VM that had not been cleaned up for over a week, which did put
a worrisome dent into my Azure credits (and I cannot risk running out of
those because a lot of stuff depends on it, e.g. GitGitGadget).

So let's merge the commit as-is, in preparation for adjusting it a
little bit further in accordance with the experience we gained in the
meantime.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho requested a review from dennisameling September 14, 2024 09:28
@dscho dscho self-assigned this Sep 14, 2024
Copy link
Contributor

@dennisameling dennisameling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes to the script!

.github/workflows/cleanup-self-hosted-runners.yml Outdated Show resolved Hide resolved
- name: Discover VMs to delete
uses: azure/CLI@v1
with:
azcliversion: 2.54.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to use the latest 2.64.0, now that it's a few months later :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same with the create-azure-self-hosted-runners and delete-self-hosted-runner GitHub workflows, too? They're at v2.43.0...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Updated those here: #94

Another thing I just noticed that this here PR uses azure/CLI@v1, while we're using azure/CLI@v2 in the other pipelines. Another thing to update :)

@dscho dscho force-pushed the clean-up-self-hosted-runaways branch from 72204b8 to d847f0e Compare September 14, 2024 10:56
@dscho
Copy link
Member Author

dscho commented Sep 14, 2024

I did not test this and suspect that the default scope of $GITHUB_TOKEN is insufficient to run that query.

I did test this and confirm that the scope is too narrow. I've added a new script to allow authenticating as a GitHub App with gh, and use that instead now.

@dscho dscho force-pushed the clean-up-self-hosted-runaways branch from e56780f to ded928e Compare September 14, 2024 11:03
@dscho dscho marked this pull request as ready for review September 14, 2024 11:08
@dscho
Copy link
Member Author

dscho commented Sep 14, 2024

@dennisameling I think this is now ready to be tested "in production" (meaning: trigger create-azure-self-hosted-runners and then temporarily switch this here workflow to on: push and lower the threshold of 1h, or maybe even change that threshold permanently, 40 minutes should be plenty enough, no? but I'll have to run a couple of errands and may not be able to come back to this today or even tomorrow).

@dennisameling
Copy link
Contributor

I'll have to run a couple of errands and may not be able to come back to this today or even tomorrow).

I'll probably have some time to test this tomorrow. In any case, let's post a quick comment in this thread when either of us starts testing, so we don't interfere with each other.

@dennisameling
Copy link
Contributor

I'm starting the test now, by creating a self-hosted runner. I'll let it sit for a bit and then trigger the workflow in this PR through the workflow_dispatch flow (which should be possible using GitHub's REST API). Otherwise I'll update the workflow to on: push for testing, as you suggested.

@dennisameling
Copy link
Contributor

The workflow failed because it can't access this repo's secrets (because your PR was created from your fork instead of this repo directly).

Let me push this branch to this repo as well so we can test more easily.

@dennisameling
Copy link
Contributor

Looks like there was a typo in your new elif statement:

/home/runner/work/_temp/AZ_CLI_GITHUB_ACTION_1726402015891.sh: line 29: unexpected EOF while looking for matching `''

Just pushed 577630b to fix, as well as some follow-up commits to fix date handling.

Looks like it's working! 🎉

We'll need to clean up this PR and also add some of the commits I added to the other branch (sorry for the mess there).

I'll be afk for a couple hours now, but if you happen to have some time, please feel free to start the wrap-up of this PR. Otherwise I can probably wrap things up here tomorrow.

dennisameling and others added 8 commits September 15, 2024 16:47
This also quotes the string properly.

While at it, also use a newer Azure CLI version

Signed-off-by: Dennis Ameling <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
…o large

In general, if..else..fi constructs are more readable when the first
block is shorter than the second, i.e. to handle the "easy" things
first; According to Cognitive Load Theory, this uses less mental
resources when wrapping our head around the code.

In this instance, there is an even better reason: The next commit will
add _another_ condition, and having it in this order will avoid adding
another nesting level:

	if too young
		skip
	else if busy
		skip
	else
		force delete VM
	fi

Signed-off-by: Johannes Schindelin <[email protected]>
In Git for Windows' automation, we do a lot with Javascript, but
sometimes it is also convenient to use the GitHub CLI (`gh`) in a shell
script.

Sadly, the scope of the `GITHUB_TOKEN` provided in GitHub workflows is
often too limited, and we'd like to authenticate as a GitHub App
instead. Even more sadly, authenticating with `gh` this way is quite
complicated and fraught with problems because the token _needs_ to be
masked in the GitHub workflow logs.

Rejoice! This patch brings a shell script that hides all that nasty
complexity. All it needs are the environment variables:

- GH_APP_ID

- GH_APP_PRIVATE_KEY

- GITHUB_REPOSITORY

and of course `gh` on the `PATH`. That's it!

Signed-off-by: Johannes Schindelin <[email protected]>
So far, we avoid deleting VMs when they are too young.

But we can do better than that: since the VMs are intended to host an
ephemeral runner, and that runner is registered with the current
repository, we can have a look whether GitHub says that the runner is
busy. If it is, well, let's leave it a-running!

Since this query requires a larger scope than the standard GitHub
workflows' `GITHUB_TOKEN` provides, we need to authenticate as the
GitForWindowsHelper GitHub App to do that. This is too complicated,
unfortunately, in a GitHub workflow (at least without the risk of using
non-official GitHub Actions), therefore we use the newly-added shell
script to authenticate the GitHub CLI as a GitHub App (which of course
now requires us to check out the repository).

Signed-off-by: Johannes Schindelin <[email protected]>
The original idea was to wait at least 3 hours before deleting a VM, to
avoid interfering with long-running workflows.

Now that we have a way to query whether the runner _is_ busy, we do not
need to wait that long. Basically, we only need to ensure that the
runner had a chance to register itself and to then pick up the job.
Typically, this takes around 6-7 minutes, so waiting for an hour is
probably even a tad conservative.

Signed-off-by: Johannes Schindelin <[email protected]>
When logging output that we're deleting a VM, it should be marked a bit
more prominently. Let's mark it as a warning:
https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-warning-message

This marks it in red in the output.

Conversely, we want to mention when we skipped the deletion, and to make
those messages stick out a bit (but less than warnings), let's use:
https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-notice-message

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho force-pushed the clean-up-self-hosted-runaways branch from 332a05d to e71129b Compare September 15, 2024 14:55
@dscho
Copy link
Member Author

dscho commented Sep 15, 2024

Excellent, @dennisameling! I cleaned up the commits and force-pushed, and will now go ahead and merge!

@dscho dscho merged commit 2099f57 into git-for-windows:main Sep 15, 2024
@dscho dscho deleted the clean-up-self-hosted-runaways branch September 15, 2024 14:55
@dscho
Copy link
Member Author

dscho commented Sep 15, 2024

While sleeping over this, I realized that there is a chance that this workflow runs while a delete-self-hosted-runner workflow run is still active, and in that case one of the two workflow runs would "lose": They would both try to delete the same resources, and the one who's on second place would then get a nasty error.

Iff the need arises, we could add another elif that looks whether any competing delete-self-hosted-runner was queued or is running by executing the following query:

gh api repos/$GITHUB_REPOSITORY/actions/runs --jq '.workflow_runs[] | select(
    .path == ".github/workflows/delete-self-hosted-runner.yml"
    and .name == "Delete $vm_name"
    and (.status == "in_progress" or .status == "queued")
) | .id'

If we do this, we need to also add GH_TOKEN: ${{ github.token }} since unauthenticated REST API calls are notorious on GitHub Actions for being throttled by virtue of running on the same IP address as so many other unauthenticated REST API calls.

@dscho
Copy link
Member Author

dscho commented Sep 15, 2024

Hrm. I looked something up in my Azure Portal and noticed some Windows/ARM64 being present. And it was the same that had been mentioned in the cleanup run earlier today!

So I re-ran that cleanup workflow run, and it said the same thing: that the same VM was too old and needed to be deleted. But again, it wasn't!

Only when I triggered a manual delete-self-hosted-runner workflow run did the VM get deleted.

Any ideas @dennisameling?

@dscho
Copy link
Member Author

dscho commented Sep 15, 2024

Oh, and I noticed something else entirely: neither node nor gh are found!

I guess we need to fiddle with this workflow a bit more, after all. How about switching away from azure/CLI to a regular run: step? Azure CLI is installed in the ubuntu-latest runners, after all.

@dennisameling
Copy link
Contributor

How about switching away from azure/CLI to a regular run: step? Azure CLI is installed in the ubuntu-latest runners, after all.

Ah, good catch! Will file a PR to fix soon.

@dennisameling
Copy link
Contributor

Here's the follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants