Skip to content

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055

Open
jensenbox wants to merge 4 commits intogithub-aws-runners:mainfrom
closient:deregister-runner-on-termination
Open

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055
jensenbox wants to merge 4 commits intogithub-aws-runners:mainfrom
closient:deregister-runner-on-termination

Conversation

@jensenbox
Copy link
Contributor

Summary

Extends the existing termination-watcher Lambda to deregister GitHub Actions runners from GitHub when their EC2 instances terminate. This prevents stale "offline" runner entries from accumulating in the organization/repository — a long-standing issue (#804, #1006, #2939) affecting all users of the module.

How it works

  1. When an EC2 instance terminates, the Lambda reads the ghr:Owner and ghr:Type tags from the instance
  2. Authenticates to GitHub using the module's existing App credentials (SSM parameters)
  3. Finds the runner by instance ID in the runner name, then calls the delete API
  4. Errors are logged but never fail the Lambda — metrics collection continues unaffected

What's included

Lambda changes:

  • deregister.ts — GitHub API deregistration logic reusing the module's existing auth pattern (createAppAuth → installation token)
  • Wired into both termination.ts (BidEvictedEvent) and termination-warning.ts (Spot Interruption Warning)
  • ConfigResolver.ts — adds enableRunnerDeregistration and ghesApiUrl config from env vars
  • 295-line test suite covering org/repo runners, not-found cases, disabled feature, and error handling

Terraform changes:

  • Passes GitHub App SSM parameter ARNs through the module chain to the termination-watcher
  • Adds SSM GetParameter IAM policy when deregistration is enabled
  • Adds PARAMETER_GITHUB_APP_ID_NAME, PARAMETER_GITHUB_APP_KEY_BASE64_NAME, ENABLE_RUNNER_DEREGISTRATION, and GHES_URL environment variables to both Lambda functions
  • Adds an EC2 Instance State-change Notification EventBridge rule (state: shutting-down) that catches all termination types — not just spot-specific events. This covers scale-down, manual termination, ASG termination, and spot reclamation.

New variables on instance_termination_watcher:

  • enable_runner_deregistration (bool, default false)

Design decisions

  • Opt-in: Disabled by default to avoid breaking existing deployments. Enable with enable_runner_deregistration = true.
  • Reuses existing auth pattern: Same @octokit/auth-app + SSM approach used by the control-plane Lambda.
  • Reuses existing Lambda: The state-change EventBridge rule targets the same notification Lambda rather than creating a new one, since both event types provide detail['instance-id'].
  • Graceful failure: All deregistration errors are caught and logged. If the runner is already removed, it logs and returns. The Lambda never fails due to deregistration issues.
  • Supports Org and Repo runners: Reads ghr:Type tag to determine the correct API endpoint.
  • GHES compatible: Passes through the ghes_url variable for GitHub Enterprise Server deployments.

Testing

  • 44 unit tests pass (7 test files), including the new deregister.test.ts
  • Tested in production: manually terminated a runner instance → Lambda triggered within seconds → runner successfully deregistered from GitHub org

Fixes #804

@jensenbox jensenbox requested review from a team as code owners March 6, 2026 07:48
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch 2 times, most recently from f731868 to a9ca792 Compare March 6, 2026 07:55
Copy link
Contributor

@Brend-Smits Brend-Smits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jensenbox

This is a great addition, thanks a lot for your contribution.

After testing this together with @stuartp44 I ran into a problem when the termination watcher tried to deregister a runner. The error was as following:

{
    "level": "ERROR",
    "message": "Failed to deregister runner from GitHub",
    "timestamp": "2026-03-06T10:07:02.489Z",
    "service": "spot-termination-notification",
    "sampling_rate": 0,
    "xray_trace_id": "1-69aaa741-3e6ab9e024a6cc5567e5f339",
    "region": "eu-west-1",
    "environment": "framework-dev",
    "module": "deregister",
    "aws-request-id": "87f61dc0-1c03-456a-9bf9-e5542558eac3",
    "function-name": "framework-dev-spot-termination-notification",
    "instanceId": "i-0c86dff9c4dfb59fc",
    "owner": "test-runners/multi-runner",
    "error": {
        "name": "HttpError",
        "location": "file:///var/task/index.js:95395",
        "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
        "stack": "HttpError: Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository\n    at fetchWrapper (file:///var/task/index.js:95395:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:103:5)\n    at async Job.doExecute (file:///var/task/index.js:83521:18)",
        "status": 422,
        "request": {
            "method": "DELETE",
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "headers": {
                "accept": "application/vnd.github.v3+json",
                "user-agent": "github-aws-runners-termination-watcher octokit-rest.js/22.0.1 octokit-core.js/7.0.6 Node.js/24",
                "authorization": "token [REDACTED]"
            },
            "request": {}
        },
        "response": {
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "status": 422,
            "headers": {
                "access-control-allow-origin": "*",
                "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
                "content-length": "260",
                "content-security-policy": "default-src 'none'",
                "content-type": "application/json; charset=utf-8",
                "date": "Fri, 06 Mar 2026 10:07:02 GMT",
                "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
                "server": "github.com",
                "strict-transport-security": "max-age=31536000; includeSubdomains; preload",
                "vary": "Accept-Encoding, Accept, X-Requested-With",
                "x-accepted-github-permissions": "administration=write",
                "x-content-type-options": "nosniff",
                "x-frame-options": "deny",
                "x-github-api-version-selected": "2022-11-28",
                "x-github-media-type": "github.v3; format=json",
                "x-github-request-id": "E8C2:1597F:2CF110:372C21:69AAA746",
                "x-ratelimit-limit": "15000",
                "x-ratelimit-remaining": "14994",
                "x-ratelimit-reset": "1772795053",
                "x-ratelimit-resource": "core",
                "x-ratelimit-used": "6",
                "x-xss-protection": "0"
            },
            "data": {
                "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted.",
                "documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
                "status": "422"
            }
        }
    }
}

I would suggest adding some sort of retry mechanism with exponential backoff (which may be configurable).
On another note, I also see in the logs Received spot notification for undefined, are you also seeing undefined in your logs?
The rest looks great, looking forward testing this again 🚀

@jensenbox
Copy link
Contributor Author

Hey @Brend-Smits, thanks for testing and the detailed report!

I've pushed a fix (653fd67) that addresses both issues:

1. Runner busy 422 — retry with exponential backoff

Added deleteRunnerWithRetry() that catches the specific 422 "currently running a job" error and retries up to 5 times with exponential backoff (1s → 2s → 4s → 8s → 16s). Non-422 errors are not retried and still fail gracefully. Each retry attempt is logged at WARN level so you can observe the behavior:

WARN: Runner is currently running a job, retrying after delay
  { instanceId, runnerId, runnerName, owner, attempt: 1, maxRetries: 5, delayMs: 1000 }

2. "Received spot notification for undefined"

Yes, we were seeing this too! This happens when metrics are disabled (ENABLE_METRICS_SPOT_WARNING=false / ENABLE_METRICS_SPOT_TERMINATION=false) — the metricName is passed as undefined and gets interpolated into the log string. Fixed so the log now reads "Received spot notification" when no metric name is set.

We've been running this feature in our production environment (closient) and confirmed both issues in our CloudWatch logs. All 47 tests pass including 3 new tests for the retry logic.

Let us know how retesting goes!

jensenbox and others added 3 commits March 19, 2026 23:02
When EC2 instances running GitHub Actions runners terminate (spot
interruption, scale-down), the runner stays registered as "offline"
in GitHub. This extends the termination-watcher Lambda to deregister
runners via the GitHub API, catching all termination causes.

Lambda changes:
- New deregister.ts with GitHub App auth, runner lookup, and deletion
- ConfigResolver adds enableRunnerDeregistration and ghesApiUrl
- Both termination.ts and termination-warning.ts call deregister
- Dependencies: @octokit/auth-app, @octokit/rest, @aws-github-runner/aws-ssm-util

Terraform changes:
- termination-watcher module: new env vars, conditional SSM IAM policy
- multi-runner module: wire github_app_parameters through, add
  enable_runner_deregistration variable (defaults to true)

Feature-flagged via ENABLE_RUNNER_DEREGISTRATION env var (default false
at module level, true in multi-runner). Deregistration failures are
caught and logged without breaking existing metric functionality.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root (single-runner) module also uses termination-watcher but wasn't
wiring github_app_parameters through. Add enable_runner_deregistration,
github_app_parameters, and ghes_url to the root module's termination
watcher config, matching the multi-runner changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include pre-built Lambda zip for use when referencing this fork branch
as a Terraform module source (no GitHub release available for the
download-lambda module to pull from).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch from 83eccbd to 03ce697 Compare March 20, 2026 06:07
The existing spot-specific rules (BidEvictedEvent, Spot Interruption Warning)
only fire on AWS spot reclamations. Scale-down terminations and manual
terminations — the most common causes of stale runners — were not covered.

Add an EC2 Instance State-change Notification rule (state: shutting-down) that
catches ALL termination types. Reuses the same notification Lambda since both
event types have detail['instance-id']. Gated behind enable_runner_deregistration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch from 03ce697 to db6a268 Compare March 20, 2026 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deregister Runner Application when Spot Interruption signal is received

2 participants