feat: retry + independent cleanup in stop (Phase 5)#23
Merged
Conversation
Closes #11. Part of plan #15. ## Scope Add a small retry helper and wrap the two cleanup paths in stop() with it. Make stop() attempt both cleanups independently so a GitHub-side failure doesn't prevent EC2 termination and vice versa. ## New module: src/retry.js withRetry(step, fn, { attempts, baseMs, maxMs }) - Default: 3 attempts, 2s base, doubled per retry, capped at 10s. Worst-case total wait: 2s + 4s + 8s = 14s before giving up. - Logs each retry via log.warn() with the step name suffixed '_retry' so the Actions run log shows attempt sequence. - On final failure, emits log.error with exhausted: true and re-throws the last error. ## Call sites - gh.removeRunner() — the DELETE /runners/{id} request wrapped in withRetry('remove_runner', ...). Idempotent: deleting a runner that's already gone returns 404 which is a permanent failure (not transient) but the existing get-runner-first path filters that out before the DELETE. - aws.terminateEc2Instance() — the TerminateInstances command wrapped in withRetry('terminate_instance', ...). Idempotent: terminating an already-terminated instance is a no-op per EC2. - start() path deliberately not wrapped. RunInstances is not idempotent without a ClientToken (billing implication) and the runner-version-specific AMI lookup retrying makes no sense. If start fails, the whole action fails; the consumer's workflow fallback handles retry at the workflow level. ## stop() independence Before: 'await A; await B'. If A throws, B never runs. After: both wrapped in try/catch, failures accumulated into a list, and at the end — if any failed — one Error is thrown summarizing both. So the EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. ## Tests tests/retry.test.js — 4 cases: - Resolves on first-try success. - Retries on rejection and returns the eventual success value. - Exhausts attempts and re-throws after emitting warn + error logs. - Backoff caps at maxMs (verified by inspecting the delay values emitted in warn logs). Total: 30 -> 34 tests. ## Consumer impact Transparent. Default behavior matches what the old code did for the happy path. The retry / independent-cleanup changes only affect transient-failure flows that previously surfaced as 'stop-runner failed' jobs. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
kurok
added a commit
to namecheap/terraform-provider-namecheap
that referenced
this pull request
Apr 21, 2026
…cleanup) (#186) namecheap/ec2-github-runner#23 merged. Phase 5 adds exponential-backoff retry on removeRunner() + terminateEc2Instance(), and makes stop() attempt both cleanups independently (a GitHub-side failure no longer prevents EC2 termination). Default behavior on happy path is unchanged. Rotation: b1b8d6d (Phase 7, logging) -> 46cf1d0 (Phase 5). Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #11. Part of plan #15. Builds on #22's structured logging.
Scope
Small retry helper + wraps the two cleanup paths in
stop()+ makesstop()attempt both cleanups independently.src/retry.jsExponential backoff, worst-case total wait 14 s (2 + 4 + 8). Every retry emits
log.warnwith the step name suffixed_retry; final failure emitslog.errorwithexhausted: true. Callers pass only idempotent operations (DELETE/runners/{id}andTerminateInstances).Call sites wrapped
gh.removeRunner()— DELETE request.aws.terminateEc2Instance()— TerminateInstances command.Call sites deliberately not wrapped
aws.startEc2Instance()/RunInstances— not idempotent without a ClientToken. Retrying on its own has billing implications.aws.waitUntilInstanceRunning— already hasmaxWaitTime: 300.gh.getRegistrationToken— transient network failure here is rare, and retrying burns registration tokens.gh.waitForRunnerRegistered— already has a 5-minute timeout + 10-second polling loop.Independent cleanup in
stop()Before:
await A; await B— if A throws, B never runs.After: both wrapped in try/catch, failures accumulated, summary Error thrown at the end. The EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. This is the most important bit for production — a GitHub outage can no longer leave orphan EC2 instances billing against us.
Tests
4 new cases in
tests/retry.test.js:maxMs(inspected via delay values in warn payloads).Total: 30 → 34 tests.
Consumer impact
Default behavior matches pre-change for the happy path. Retry + independence only affect transient-failure flows. Log traffic on a successful run is unchanged from Phase 7.
Dogfood
Low risk — changes only error-recovery paths. Provider rotation after this merges, same pattern as #22's (which just went green on machulav#185).