feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055
feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055jensenbox wants to merge 4 commits intogithub-aws-runners:mainfrom
Conversation
f731868 to
a9ca792
Compare
Brend-Smits
left a comment
There was a problem hiding this comment.
Hey @jensenbox
This is a great addition, thanks a lot for your contribution.
After testing this together with @stuartp44 I ran into a problem when the termination watcher tried to deregister a runner. The error was as following:
{
"level": "ERROR",
"message": "Failed to deregister runner from GitHub",
"timestamp": "2026-03-06T10:07:02.489Z",
"service": "spot-termination-notification",
"sampling_rate": 0,
"xray_trace_id": "1-69aaa741-3e6ab9e024a6cc5567e5f339",
"region": "eu-west-1",
"environment": "framework-dev",
"module": "deregister",
"aws-request-id": "87f61dc0-1c03-456a-9bf9-e5542558eac3",
"function-name": "framework-dev-spot-termination-notification",
"instanceId": "i-0c86dff9c4dfb59fc",
"owner": "test-runners/multi-runner",
"error": {
"name": "HttpError",
"location": "file:///var/task/index.js:95395",
"message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
"stack": "HttpError: Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository\n at fetchWrapper (file:///var/task/index.js:95395:11)\n at process.processTicksAndRejections (node:internal/process/task_queues:103:5)\n at async Job.doExecute (file:///var/task/index.js:83521:18)",
"status": 422,
"request": {
"method": "DELETE",
"url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
"headers": {
"accept": "application/vnd.github.v3+json",
"user-agent": "github-aws-runners-termination-watcher octokit-rest.js/22.0.1 octokit-core.js/7.0.6 Node.js/24",
"authorization": "token [REDACTED]"
},
"request": {}
},
"response": {
"url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
"status": 422,
"headers": {
"access-control-allow-origin": "*",
"access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
"content-length": "260",
"content-security-policy": "default-src 'none'",
"content-type": "application/json; charset=utf-8",
"date": "Fri, 06 Mar 2026 10:07:02 GMT",
"referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
"server": "github.com",
"strict-transport-security": "max-age=31536000; includeSubdomains; preload",
"vary": "Accept-Encoding, Accept, X-Requested-With",
"x-accepted-github-permissions": "administration=write",
"x-content-type-options": "nosniff",
"x-frame-options": "deny",
"x-github-api-version-selected": "2022-11-28",
"x-github-media-type": "github.v3; format=json",
"x-github-request-id": "E8C2:1597F:2CF110:372C21:69AAA746",
"x-ratelimit-limit": "15000",
"x-ratelimit-remaining": "14994",
"x-ratelimit-reset": "1772795053",
"x-ratelimit-resource": "core",
"x-ratelimit-used": "6",
"x-xss-protection": "0"
},
"data": {
"message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted.",
"documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
"status": "422"
}
}
}
}
I would suggest adding some sort of retry mechanism with exponential backoff (which may be configurable).
On another note, I also see in the logs Received spot notification for undefined, are you also seeing undefined in your logs?
The rest looks great, looking forward testing this again 🚀
|
Hey @Brend-Smits, thanks for testing and the detailed report! I've pushed a fix (653fd67) that addresses both issues: 1. Runner busy 422 — retry with exponential backoffAdded 2. "Received spot notification for undefined"Yes, we were seeing this too! This happens when metrics are disabled ( We've been running this feature in our production environment (closient) and confirmed both issues in our CloudWatch logs. All 47 tests pass including 3 new tests for the retry logic. Let us know how retesting goes! |
bec1fc0 to
83eccbd
Compare
When EC2 instances running GitHub Actions runners terminate (spot interruption, scale-down), the runner stays registered as "offline" in GitHub. This extends the termination-watcher Lambda to deregister runners via the GitHub API, catching all termination causes. Lambda changes: - New deregister.ts with GitHub App auth, runner lookup, and deletion - ConfigResolver adds enableRunnerDeregistration and ghesApiUrl - Both termination.ts and termination-warning.ts call deregister - Dependencies: @octokit/auth-app, @octokit/rest, @aws-github-runner/aws-ssm-util Terraform changes: - termination-watcher module: new env vars, conditional SSM IAM policy - multi-runner module: wire github_app_parameters through, add enable_runner_deregistration variable (defaults to true) Feature-flagged via ENABLE_RUNNER_DEREGISTRATION env var (default false at module level, true in multi-runner). Deregistration failures are caught and logged without breaking existing metric functionality. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root (single-runner) module also uses termination-watcher but wasn't wiring github_app_parameters through. Add enable_runner_deregistration, github_app_parameters, and ghes_url to the root module's termination watcher config, matching the multi-runner changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include pre-built Lambda zip for use when referencing this fork branch as a Terraform module source (no GitHub release available for the download-lambda module to pull from). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
83eccbd to
03ce697
Compare
The existing spot-specific rules (BidEvictedEvent, Spot Interruption Warning) only fire on AWS spot reclamations. Scale-down terminations and manual terminations — the most common causes of stale runners — were not covered. Add an EC2 Instance State-change Notification rule (state: shutting-down) that catches ALL termination types. Reuses the same notification Lambda since both event types have detail['instance-id']. Gated behind enable_runner_deregistration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
03ce697 to
db6a268
Compare
Summary
Extends the existing termination-watcher Lambda to deregister GitHub Actions runners from GitHub when their EC2 instances terminate. This prevents stale "offline" runner entries from accumulating in the organization/repository — a long-standing issue (#804, #1006, #2939) affecting all users of the module.
How it works
ghr:Ownerandghr:Typetags from the instanceWhat's included
Lambda changes:
deregister.ts— GitHub API deregistration logic reusing the module's existing auth pattern (createAppAuth→ installation token)termination.ts(BidEvictedEvent) andtermination-warning.ts(Spot Interruption Warning)ConfigResolver.ts— addsenableRunnerDeregistrationandghesApiUrlconfig from env varsTerraform changes:
GetParameterIAM policy when deregistration is enabledPARAMETER_GITHUB_APP_ID_NAME,PARAMETER_GITHUB_APP_KEY_BASE64_NAME,ENABLE_RUNNER_DEREGISTRATION, andGHES_URLenvironment variables to both Lambda functionsEC2 Instance State-change NotificationEventBridge rule (state:shutting-down) that catches all termination types — not just spot-specific events. This covers scale-down, manual termination, ASG termination, and spot reclamation.New variables on
instance_termination_watcher:enable_runner_deregistration(bool, defaultfalse)Design decisions
enable_runner_deregistration = true.@octokit/auth-app+ SSM approach used by the control-plane Lambda.detail['instance-id'].ghr:Typetag to determine the correct API endpoint.ghes_urlvariable for GitHub Enterprise Server deployments.Testing
deregister.test.tsFixes #804