fix: prevent EC2 termination when GitHub runner de-registration fails#5061
fix: prevent EC2 termination when GitHub runner de-registration fails#5061shivdesh wants to merge 1 commit intogithub-aws-runners:mainfrom
Conversation
|
@Brend-Smits do you have time to check this PR, since you digged in similar PRs |
8b5e8ef to
a5b332f
Compare
|
Hi @shivdesh, Thanks for the contribution! I recently added something similar, but for the scale-up lambda. Please see: https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4990/changes |
a5b332f to
e40d2b7
Compare
|
Hi @Brend-Smits, Thanks for the feedback! I reviewed PR #4990 and you're right - the I've updated this PR to:
Thanks for pointing me to your work! |
e40d2b7 to
763938c
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the scale-down Lambda’s failure handling to prevent terminating an EC2 runner instance when GitHub runner de-registration fails, reducing the risk of leaving stale self-hosted runner entries behind.
Changes:
- Extracts a
deleteGitHubRunner()helper to de-register runners with per-runner error handling. - Updates
removeRunner()to terminate the EC2 instance only when all GitHub de-registrations succeed. - Adds a test ensuring the instance is not terminated when the de-registration call throws.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lambdas/functions/control-plane/src/scale-runners/scale-down.ts | Adds per-runner de-registration helper and gates EC2 termination on full de-registration success. |
| lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts | Adds regression test for thrown de-registration errors preventing termination. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
6bab611 to
aa2e75b
Compare
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle - Rename githubAppClient to githubInstallationClient for clarity - Refactor to split owner/repo once instead of multiple times - Fix error logging to handle non-Error objects properly The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
aa2e75b to
1f66402
Compare
Summary
This PR complements #4990 by ensuring that when GitHub runner de-registration fails (even after automatic retries), the EC2 instance is not terminated. This prevents stale runner entries in GitHub org settings.
Problem
When the scale-down Lambda fails to de-register a runner from GitHub (e.g., due to persistent API errors), the current code still terminates the EC2 instance. This leaves stale/offline runner entries in the GitHub organization settings.
We encountered this issue in production where transient 502 errors during scale-down left 117 stale runner entries in our GitHub organization.
Relationship to #4990
PR #4990 added
@octokit/plugin-retrywhich provides automatic retries at the Octokit client level. This is great for handling transient failures. However, if de-registration ultimately fails after all retries, we still need to handle that gracefully by NOT terminating the EC2 instance.Solution
deleteGitHubRunner()helper that catches errors per-runnerChanges
lambdas/functions/control-plane/src/scale-runners/scale-down.ts:deleteGitHubRunner()helper functionremoveRunner()to only terminate EC2 if all de-registrations succeedTesting
Why not custom retry logic?
The
@octokit/plugin-retry(added in #4990) already handles automatic retries at the client level, so no custom retry logic is needed. This PR focuses solely on the failure handling aspect - what to do when de-registration fails after all retries.