-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up self hosted runaways #93
Clean up self hosted runaways #93
Conversation
…ariable The image SKU was hidden a bit in the azure-arm-template.json. Let's make it more visible by turning it into a parameter that the CI pipeline can provide. Signed-off-by: Dennis Ameling <[email protected]>
Version 23h2 was released on October 31st, 2023. Let's update to it, so that the runners benefit from the latest improvements to Windows 11 ARM64. Signed-off-by: Dennis Ameling <[email protected]>
Sometimes, VMs don't get deleted properly. Let's set up a script to clean up VMs that were created more than 3 hours ago. This script runs every 6 hours (cron) and we can also invoke it manually if needed. Signed-off-by: Dennis Ameling <[email protected]>
This resurrects a GitHub workflow I had originally asked to keep out of git-for-windows#60, when I was concerned that we might delete runner VMs while they are still busy running a workflow. However, in the meantime it happened already twice that I had missed a run-away VM that had not been cleaned up for over a week, which did put a worrisome dent into my Azure credits (and I cannot risk running out of those because a lot of stuff depends on it, e.g. GitGitGadget). So let's merge the commit as-is, in preparation for adjusting it a little bit further in accordance with the experience we gained in the meantime. Signed-off-by: Johannes Schindelin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes to the script!
- name: Discover VMs to delete | ||
uses: azure/CLI@v1 | ||
with: | ||
azcliversion: 2.54.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably want to use the latest 2.64.0, now that it's a few months later :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same with the create-azure-self-hosted-runners
and delete-self-hosted-runner
GitHub workflows, too? They're at v2.43.0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Updated those here: #94
Another thing I just noticed that this here PR uses azure/CLI@v1
, while we're using azure/CLI@v2
in the other pipelines. Another thing to update :)
72204b8
to
d847f0e
Compare
I did test this and confirm that the scope is too narrow. I've added a new script to allow authenticating as a GitHub App with |
e56780f
to
ded928e
Compare
@dennisameling I think this is now ready to be tested "in production" (meaning: trigger |
I'll probably have some time to test this tomorrow. In any case, let's post a quick comment in this thread when either of us starts testing, so we don't interfere with each other. |
I'm starting the test now, by creating a self-hosted runner. I'll let it sit for a bit and then trigger the workflow in this PR through the |
The workflow failed because it can't access this repo's secrets (because your PR was created from your fork instead of this repo directly). Let me push this branch to this repo as well so we can test more easily. |
Looks like there was a typo in your new
Just pushed 577630b to fix, as well as some follow-up commits to fix date handling. Looks like it's working! 🎉 We'll need to clean up this PR and also add some of the commits I added to the other branch (sorry for the mess there). I'll be afk for a couple hours now, but if you happen to have some time, please feel free to start the wrap-up of this PR. Otherwise I can probably wrap things up here tomorrow. |
Signed-off-by: Dennis Ameling <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
This also quotes the string properly. While at it, also use a newer Azure CLI version Signed-off-by: Dennis Ameling <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
…o large In general, if..else..fi constructs are more readable when the first block is shorter than the second, i.e. to handle the "easy" things first; According to Cognitive Load Theory, this uses less mental resources when wrapping our head around the code. In this instance, there is an even better reason: The next commit will add _another_ condition, and having it in this order will avoid adding another nesting level: if too young skip else if busy skip else force delete VM fi Signed-off-by: Johannes Schindelin <[email protected]>
In Git for Windows' automation, we do a lot with Javascript, but sometimes it is also convenient to use the GitHub CLI (`gh`) in a shell script. Sadly, the scope of the `GITHUB_TOKEN` provided in GitHub workflows is often too limited, and we'd like to authenticate as a GitHub App instead. Even more sadly, authenticating with `gh` this way is quite complicated and fraught with problems because the token _needs_ to be masked in the GitHub workflow logs. Rejoice! This patch brings a shell script that hides all that nasty complexity. All it needs are the environment variables: - GH_APP_ID - GH_APP_PRIVATE_KEY - GITHUB_REPOSITORY and of course `gh` on the `PATH`. That's it! Signed-off-by: Johannes Schindelin <[email protected]>
So far, we avoid deleting VMs when they are too young. But we can do better than that: since the VMs are intended to host an ephemeral runner, and that runner is registered with the current repository, we can have a look whether GitHub says that the runner is busy. If it is, well, let's leave it a-running! Since this query requires a larger scope than the standard GitHub workflows' `GITHUB_TOKEN` provides, we need to authenticate as the GitForWindowsHelper GitHub App to do that. This is too complicated, unfortunately, in a GitHub workflow (at least without the risk of using non-official GitHub Actions), therefore we use the newly-added shell script to authenticate the GitHub CLI as a GitHub App (which of course now requires us to check out the repository). Signed-off-by: Johannes Schindelin <[email protected]>
The original idea was to wait at least 3 hours before deleting a VM, to avoid interfering with long-running workflows. Now that we have a way to query whether the runner _is_ busy, we do not need to wait that long. Basically, we only need to ensure that the runner had a chance to register itself and to then pick up the job. Typically, this takes around 6-7 minutes, so waiting for an hour is probably even a tad conservative. Signed-off-by: Johannes Schindelin <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
When logging output that we're deleting a VM, it should be marked a bit more prominently. Let's mark it as a warning: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-warning-message This marks it in red in the output. Conversely, we want to mention when we skipped the deletion, and to make those messages stick out a bit (but less than warnings), let's use: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-notice-message Signed-off-by: Johannes Schindelin <[email protected]>
332a05d
to
e71129b
Compare
Excellent, @dennisameling! I cleaned up the commits and force-pushed, and will now go ahead and merge! |
While sleeping over this, I realized that there is a chance that this workflow runs while a Iff the need arises, we could add another
If we do this, we need to also add |
Hrm. I looked something up in my Azure Portal and noticed some Windows/ARM64 being present. And it was the same that had been mentioned in the cleanup run earlier today! So I re-ran that cleanup workflow run, and it said the same thing: that the same VM was too old and needed to be deleted. But again, it wasn't! Only when I triggered a manual Any ideas @dennisameling? |
Oh, and I noticed something else entirely: neither I guess we need to fiddle with this workflow a bit more, after all. How about switching away from |
Ah, good catch! Will file a PR to fix soon. |
Resurrect the part dropped from #60, where we considered to add a scheduled workflow to clean up self-hosted runners that were not cleaned up as intended.
This PR adds a couple of commits on top, most notably querying GitHub whether the runner in question is still busy with something.
[EDITED] I verified in a manual test that the default scope of
$GITHUB_TOKEN
is insufficient to run that query. Therefore, we now authenticate the GitHub CLI as a GitHub App to be able to use that query.[EDITED] Also: I changed the workflow to use notices and warnings to make it easier to spot what the individual workflow runs were doing.
@dennisameling would you kindly have a look?