Skip to content

Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup #11

@kurok

Description

@kurok

Part of plan #15. Phase 5 — Lifecycle & Cleanup Reliability.

Problem

Current stop-runner path is best-effort:

  • removeRunner() calls the GitHub API once. If it 500s or times out, the runner stays registered in GitHub (visible in Settings → Actions → Runners indefinitely).
  • terminateEc2Instance() calls EC2 TerminateInstances once. If the AWS call times out, the instance keeps running (billing).
  • No explicit timeout on waitForInstanceRunning / waitForRunnerRegistered; a stuck call could pin the job.

Phase 4's --ephemeral mitigates the "stale runner" case via GitHub-side auto-deregistration, but that handles one of the two cleanup paths. Explicit retries on both paths are still the defense-in-depth answer.

Target

  • Retry removeRunner() with exponential backoff (3 attempts, base 2s, max 10s).
  • Retry terminateEc2Instance() with same policy.
  • Bounded timeout on waitForRunnerRegistered (default 5 min; input-overridable).
  • Bounded timeout on waitForInstanceRunning (default 5 min).
  • On mode: stop, attempt both cleanups even if one throws — do not let a GitHub API failure prevent EC2 termination, or vice versa.
  • Structured log line on every attempt so the Actions run summary shows what was tried.

Pseudocode shape

async function stop() {
  const errors = [];
  try { await withRetry(() => gh.removeRunner(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['gh.removeRunner', e]); }

  try { await withRetry(() => aws.terminateEc2Instance(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['aws.terminateEc2Instance', e]); }

  if (errors.length) {
    for (const [where, err] of errors) core.error(`${where}: ${err.message}`);
    core.setFailed(`stop mode completed with ${errors.length} cleanup failure(s)`);
  }
}

Compatibility with consumers

Fully transparent improvement. Consumers today already guard stop-runner with if: always() && ... so the step runs on acceptance-test failure; the retry + bounded timeout makes that guard more reliable.

Acceptance criteria

  • stop() attempts both cleanups independently; neither short-circuits the other.
  • 3-attempt exponential backoff on both AWS and GitHub calls.
  • New inputs aws-timeout-seconds and github-timeout-seconds (optional, defaults sane).
  • Structured log lines for every attempt, visible in the Actions run summary.
  • Unit test: inject a failing first attempt; verify the second succeeds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions