fix: prevent EC2 termination when GitHub runner de-registration fails by shivdesh · Pull Request #5064 · github-aws-runners/terraform-aws-github-runner

shivdesh · 2026-03-11T19:33:57Z

Summary

This PR complements #4990 by ensuring that when GitHub runner de-registration fails (even after automatic retries), the EC2 instance is not terminated. This prevents stale runner entries in GitHub org settings.

Problem

When the scale-down Lambda fails to de-register a runner from GitHub (e.g., due to persistent API errors), the current code still terminates the EC2 instance. This leaves stale/offline runner entries in the GitHub organization settings.

We encountered this issue in production where transient 502 errors during scale-down left 117 stale runner entries in our GitHub organization.

Relationship to #4990

PR #4990 added @octokit/plugin-retry which provides automatic retries at the Octokit client level. This is great for handling transient failures. However, if de-registration ultimately fails after all retries, we still need to handle that gracefully by NOT terminating the EC2 instance.

Solution

Extract deleteGitHubRunner() helper that catches errors per-runner
Only terminate EC2 instance if all GitHub de-registrations succeed
If any de-registration fails, leave the instance running so the next scale-down cycle can retry

Changes

lambdas/functions/control-plane/src/scale-runners/scale-down.ts:
- Added deleteGitHubRunner() helper function
- Modified removeRunner() to only terminate EC2 if all de-registrations succeed

Testing

Added test verifying EC2 is NOT terminated when de-registration throws an error
All 124 scale-down tests pass

Why not custom retry logic?

The @octokit/plugin-retry (added in #4990) already handles automatic retries at the client level, so no custom retry logic is needed. This PR focuses solely on the failure handling aspect - what to do when de-registration fails after all retries.

When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails

shivdesh · 2026-03-11T20:47:52Z

Closing in favor of updating PR #5061 with the same simplified approach.

shivdesh closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent EC2 termination when GitHub runner de-registration fails#5064

fix: prevent EC2 termination when GitHub runner de-registration fails#5064
shivdesh wants to merge 1 commit intogithub-aws-runners:mainfrom
shivdesh:fix/scale-down-retry-logic-v2

shivdesh commented Mar 11, 2026

Uh oh!

shivdesh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivdesh commented Mar 11, 2026

Summary

Problem

Relationship to #4990

Solution

Changes

Testing

Why not custom retry logic?

Uh oh!

shivdesh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant