Skip to content

Feat/al2023 support#28

Draft
dm1tr1yvovk wants to merge 26 commits into
mainfrom
feat/al2023-support
Draft

Feat/al2023 support#28
dm1tr1yvovk wants to merge 26 commits into
mainfrom
feat/al2023-support

Conversation

@dm1tr1yvovk
Copy link
Copy Markdown

No description provided.

Vitaliy Sidorenko and others added 26 commits November 28, 2024 17:27
…or AL2023

Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
The user-data bootstrap hardcoded v2.321.0, whose externals/ directory
ships only node16 and node20. Any consumer trying to use a JS action
that declares 'runs: using: node24' (actions/checkout@v5+, many other
updated actions) fails on runners started via this action.

v2.333.1 is the current stable GA runner release (2026-03-27) and
bundles externals/node20 + externals/node24. No AMI change is needed
because the runner brings its own Node binaries.

Rebuilt dist/index.js with NODE_OPTIONS=--openssl-legacy-provider
npm run package (the pinned @vercel/ncc 0.25.1 doesn't work with
modern OpenSSL without that flag). The only diff inside dist/ is the
two URL/filename lines that mirror the src change.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
The existing pr.yml only ran 'npm run lint' and only on PRs targeting
main, so PRs against feature branches had zero CI coverage.

Two new jobs:

- verify-dist: runs 'npm run package' and fails if the committed
  dist/index.js disagrees with a fresh rebuild from src/. Catches
  the 'forgot to rebuild dist' class of bug permanently so future
  runner-version bumps can't ship with a stale bundle.

- verify-runner-url: extracts the pinned actions/runner version
  from src/aws.js and HEADs the corresponding release asset. Catches
  typos in the version string and release-asset naming drift before
  a bad AMI start wastes an EC2 lifecycle.

Also drops the 'branches: [main]' filter so PRs against any branch
(feat/al2023-support in particular) get linted and verified. The
existing lint-code job is left untouched to keep the diff minimal.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…code

GitHub started auto-failing workflows using actions/cache@v2 in
December 2025 (deprecation notice). actions/checkout@v2 is in the
same bucket. With this PR removing the 'branches: [main]' filter on
pr.yml, the lint-code job now runs on every PR — so the red-flag
deprecation caught by the runner causes every new PR to fail until
these two pins are rotated forward.

Minimal change: v2 → v4 on both uses: lines; everything else in the
job stays the same.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
feat: bump actions/runner to v2.333.1 for node24 support
The action.yml was still declaring 'using: node12', the oldest Node
runtime GitHub ever supported. GitHub has been auto-shimming forward
to node16 / node20 via ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION for
years; the current runner (v2.333.1, per #3) surfaces the warning:

  Node.js 20 actions are deprecated. Actions will be forced to run
  with Node.js 24 by default starting June 2nd, 2026.

Declaring node24 explicitly makes the runtime match what the runner
already ships (externals/node24) and removes the deprecation warning
from every consumer's CI page.

Compatibility:
- dist/index.js reviewed for Node-22+ removals (util.isArray,
  createCipher, etc.) — none present.
- Two 'new Buffer(...)' occurrences: one inside a lodash docstring
  comment (not executed), one in tunnel-agent's proxy-auth path
  (we never pass proxyAuth). 'new Buffer' emits DEP0005 on node24
  but still works; no runtime break.
- Verified dist is reproducible: NODE_OPTIONS=--openssl-legacy-provider
  npm run package produces byte-identical dist/index.js.

Consumers pinning this action by SHA (e.g. namecheap/terraform-
provider-namecheap) will rotate to the new commit in a follow-up.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
feat: declare action runtime as node24
The bundled @actions/core v1.2.6 still emits the deprecated
'::set-output name=X::Y' workflow command, which GitHub has warned
about since 2022-10 and has announced will be disabled outright:

  The `set-output` command is deprecated and will be disabled soon.
  Please upgrade to using Environment Files.

Two viable fixes:
- Bump @actions/core to >= 1.10.0 (the version that switched to
  GITHUB_OUTPUT internally). Tried; transitive deps pull modern JS
  (private class fields) that @vercel/ncc 0.25.1 can't parse, which
  would force a separate ncc major bump with much larger dist churn.
- Bypass core.setOutput entirely and write directly to the
  GITHUB_OUTPUT file. Modern runners always set that env var.

Going with the second option. 11-line change in src/index.js with
an equivalent 11-line diff in the rebuilt dist/index.js. No
dependency updates; ncc rebuild is reproducible.

Fallback path through core.setOutput() retained for the unlikely
case that GITHUB_OUTPUT isn't set (pre-2022 self-hosted runner or
local dry-run). Output values (label, ec2InstanceId) never contain
newlines, so the plain 'key=value\n' format is safe.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
fix: write outputs to GITHUB_OUTPUT file instead of ::set-output
The bundled aws-sdk v2 (^2.809.0 resolved to 2.814.0) uses url.parse()
internally for endpoint URL parsing. On node24 that triggers DEP0169
on every action invocation, surfacing as a warning on every consumer's
CI page:

  (node:PID) [DEP0169] DeprecationWarning: `url.parse()` behavior
  is not standardized and prone to errors that have security
  implications. Use the WHATWG URL API instead.

Three call sites verified in the bundled dist/: AWS.CloudFront's
getRtmpUrl, AWS.CloudFront.Signer.getSignedUrl, and AWS.util.urlParse.
Only the last runs at runtime (every EC2 API call parses its
endpoint). Upgrading aws-sdk to latest v2.1693 does not fix this —
the v2 SDK is in maintenance mode.

Migrating to aws-sdk v3 would rewrite src/aws.js entirely, far
beyond the scope of silencing a benign deprecation for trusted
EC2 endpoint URLs we control.

Instead, wrap process.emitWarning once at the top of src/index.js
and drop only the DEP0169 code. Every other warning — including
future new deprecations, user-installed 'warning' listeners, and
Node's default stderr formatter — is preserved by delegating to
the original emitWarning. Handles both call shapes:

  process.emitWarning(msg, type, code, ctor)    // positional
  process.emitWarning(msg, { code, type })      // options object

Smoke-tested locally with three call patterns:
- DEP0169 (filtered, silent)
- DEP0123 options form (passed through, Node's default format)
- Plain 2-arg Warning (passed through)

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
fix: silence DEP0169 url.parse deprecation from bundled aws-sdk v2
Scope note: this PR covers Phase 8 (issue #14) only for src/utils.js
and src/config.js. Tests for src/aws.js, src/gh.js, and src/index.js
are deferred to Phase 8.b, to be filed as a follow-up issue after
Phases 1, 4, and 5 rewrite those files — testing them now against
the aws-sdk v2 shape would be thrown away mid-rewrite.

What landed:

- tests/utils.test.js — sortByCreationDate covering sort order,
  in-place mutation, single-image, empty-array, and equal-date
  cases (5 tests).

- tests/config.test.js — Config constructor validation and outputs.
  The module-scope singleton pattern means every test case calls
  loadConfig(inputs) which jest.resetModules() + mocks
  @actions/core / @actions/github + re-requires. Validation
  failures don't throw — the try/catch around 'new Config()'
  routes them to core.setFailed(), so tests assert against that
  mock's .calls array via expectValidationFailure() helper.
  Covers start-mode happy path, tag specifications logic,
  stop-mode happy path, six validation-failure modes, and
  generateUniqueLabel output shape (16 tests).

Tooling:

- jest 29.7.0 added as devDependency (pinned exact).
- 'npm test' script wired in package.json.
- 'unit-tests' job added to .github/workflows/pr.yml alongside the
  existing lint-code / verify-dist / verify-runner-url jobs.

Verified locally: 'npm test' reports 'Test Suites: 2 passed,
Tests: 21 passed' in ~0.3 s.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
* feat: migrate aws-sdk v2 to @aws-sdk/client-ec2 v3 (Phase 1)

Completes Phase 1 (issue #7) of the modernization plan.

## Dependency changes

- Remove 'aws-sdk' ^2.809.0 (in maintenance mode since late 2024;
  end-of-support announced for Nov 2025; emits DEP0169 on url.parse()
  in modern Node).
- Add '@aws-sdk/client-ec2' 3.1033.0 pinned exact. Per-service package
  is dramatically smaller than v2's monolithic bundle.
- Bump '@vercel/ncc' 0.25.1 -> 0.38.4. The old ncc couldn't parse
  modern JS (private class fields in v3's transitive deps); 0.38 is
  webpack-5-based and handles current syntax.

## Code changes (src/aws.js)

Rewrite on the EC2Client + Command pattern:

- EC2Client({}) replaces 'new AWS.EC2()'. Reads region + creds from
  env (same behavior, same source — configure-aws-credentials or
  instance profile).
- client.send(new DescribeImagesCommand(params)) replaces
  ec2.describeImages(params).promise().
- client.send(new RunInstancesCommand(params)) replaces
  ec2.runInstances(params).promise().
- client.send(new TerminateInstancesCommand(params)) replaces
  ec2.terminateInstances(params).promise().
- client.send(new AssociateAddressCommand(params)) replaces
  ec2.associateAddress(params).promise().
- waitUntilInstanceRunning({client, maxWaitTime: 300}, {InstanceIds})
  replaces ec2.waitFor('instanceRunning', ...).promise().

External action contract (inputs + outputs) is unchanged. Consumer
workflows (notably terraform-provider-namecheap) do not need any
change beyond rotating their SHA pin.

## Bundle + CI

- 'npm run package' no longer needs NODE_OPTIONS=--openssl-legacy-provider
  (ncc 0.38 + webpack 5 don't use the legacy module-hash MD4 path).
- dist/ now contains code-split chunk files (136.index.js, 360.index.js,
  etc.) alongside dist/index.js. All must be committed; the verify-dist
  CI check in pr.yml is broadened to diff the whole dist/ tree.
- Bundle size: 7.9 MB -> 3.4 MB main (+ ~3.3 MB in chunks). Net smaller
  than v2.

## Backward compatibility

The DEP0169 process.emitWarning filter added in #6 stays in place.
The v3 bundle doesn't emit DEP0169, so the filter is effectively inert
on the current tip — cleanup is a follow-up, not part of this PR's scope.

## Verification

- npm test: 21 tests pass across tests/utils.test.js + tests/config.test.js
  (no aws.js tests yet; those land as Phase 8.b after Phase 4/5 stop
  rewriting aws.js).
- npm run lint: clean.
- Bundle builds cleanly on Node 20 without OpenSSL legacy provider.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

* chore: add .gitattributes to normalize line endings in dist/

ncc 0.38's output contains 451 source-embedded CR bytes inside string
literals (aws-sdk transitive deps). When dist/ is committed through
git's default line-ending normalization, those CRs are stripped into
the blob, but every subsequent 'npm run package' reproduces them —
creating a permanent, symmetric 451/451 diff that the verify-dist CI
gate correctly flagged as drift.

Mark the whole dist/ tree as binary via .gitattributes so git never
converts line endings in that path. What ncc writes is what git
stores; CI's rebuild produces byte-identical output.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

---------

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…sion (Phase 4) (#18)

Closes #10. Biggest compatibility risk in the modernization plan,
called out in the #15 tracker as needing a provider-repo dogfood
before landing.

## Bootstrap rewrite

The EC2 user-data now:

- set -euo pipefail throughout — a silent useradd / tar / sha256sum
  failure kills the bootstrap instead of proceeding to a broken
  ./run.sh.
- Creates a dedicated 'runner' user (idempotent — skipped if it
  already exists, so re-runs from a crash-loop don't explode).
- Drops to that user via 'sudo -u runner -H bash <<RUNNER_BOOTSTRAP'
  for every subsequent step. The old 'export RUNNER_ALLOW_RUNASROOT=1'
  escape hatch is gone.
- Fetches the runner tarball and SHA-256-verifies it against
  actions/runner's published '.sha256' sidecar before extraction.
  Same defense-in-depth pattern the provider repo uses for Go and
  Terraform downloads (namecheap/terraform-provider-namecheap#160).
- Passes '--ephemeral --unattended --disableupdate' to config.sh.
  GitHub auto-deregisters the runner after one job — the existing
  removeRunner() API call in src/gh.js becomes belt-and-braces rather
  than the primary deregister path. --disableupdate keeps the runner
  binary stable for the short-lived ephemeral session.

## New 'runner-version' input

Optional, defaults to '2.333.1' (the version this PR is tested
against). Consumers can override without waiting for a new action
release — useful when GitHub gates a JS action on a newer node
runtime and we need to move fast.

src/config.js reads it with a default fallback so old callers that
don't set it continue to work.

## CI adjustment

The existing verify-runner-url job greps the literal version string
out of the source to HEAD-check the release asset. With the version
now parameterized, the literal lives in action.yml's 'default:',
so the extractor is rewritten to read it from there.

## Tests

tests/config.test.js adds two cases:
- defaults to 2.333.1 when runner-version input is unset
- honors an explicit override

Full suite: 23 tests pass across utils + config.

## Consumer impact (terraform-provider-namecheap acctest)

- make testacc is 'go test' — no root required.
- All setup steps (curl Go / Terraform, extract tarballs, write
  go-env.sh) write to $GITHUB_WORKSPACE which is writable by any
  runner user, not just root.
- actions/checkout@v6 writes to the workspace, no root.
- The workspace directory structure is unchanged beyond its absolute
  path (/home/runner/actions-runner/_work/... instead of
  /actions-runner/_work/...). GITHUB_WORKSPACE, HOME, and relative
  paths all resolve the same way.

The dogfood SHA-pin rotation will be opened on the provider repo
after this merges, mirroring the pattern from machulav#158machulav#159.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Phase 4 (#18) landed three independent hardenings in one PR:
- New configurable runner-version input (no runtime impact)
- Ephemeral + checksum + set -euo pipefail (additive safety)
- Root to non-root runner user via sudo-heredoc (behavioral change)

The dogfood rotation on terraform-provider-namecheap#182 failed —
'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner
registration. EC2 instance booted fine, but whatever the user-data
did inside the instance, it didn't end at './run.sh' polling GitHub.

We can't post-mortem directly because the instance is ephemeral and
already terminated. Fix-forward strategy: revert ONLY the non-root
transition (highest-probability culprit among the three axes), keep
everything else from Phase 4.

If the Phase 4 dogfood rotation passes after this revert, the
root-to-runner sudo-heredoc is the breaker and can be investigated
as an isolated follow-up (likely candidates: sudoers config on the
hardened AMI, SELinux context, config.sh writing outside its own
directory, or my heredoc quoting). Landing the safer pieces now
unblocks Phases 5/6/7.

Kept:
- runner-version input (Phase 4's main feature)
- set -euo pipefail
- --ephemeral + --unattended + --disableupdate on config.sh
- SHA-256 verification of the runner tarball
- Clearer bash syntax ('case "$(uname -m)"', double-quoted vars)

Reverted:
- useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc
- RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again)

The non-root goal isn't lost — a follow-up issue will propose it
again, this time with better instrumentation so we can see what
failed inside the instance.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
* revert: full rollback of Phase 4 bootstrap changes

Phase 4 attempts #18 (with non-root) and #19 (without non-root but
keeping --ephemeral + checksum + set -euo pipefail + runner-version
input) BOTH failed the provider dogfood with the same 6m15s runner
registration timeout (terraform-provider-namecheap#182 and machulav#183).

The fix-forward in #19 narrowed the suspect set from 'all Phase 4
changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate
flag, checksum verify, parameterized bash vars'. Still not isolated.

Full rollback here restores the known-good Phase 1 bootstrap exactly.
Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38,
jest tests, .gitattributes).

Phase 4 work is NOT abandoned — it moves to follow-up issues where
each change lands on its own with its own dogfood, so the next
failure isolates itself to a single axis instead of requiring
bisection across five simultaneous changes.

Files reverted to match a1bd2f9 (Phase 1 tip):
- action.yml (drops runner-version input)
- src/aws.js (original 12-line bash array, yum install libicu make,
  RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify)
- src/config.js (drops runnerVersion field)
- tests/config.test.js (drops runner-version test block, 23 -> 21 tests)

Dist rebuilt against the reverted src (verify-dist will confirm).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

* ci: revert verify-runner-url extractor to grep src/aws.js

Paired with the full Phase 4 revert — now that action.yml no longer
has a runner-version default, the Phase 4 version of verify-runner-url
that reads action.yml can't find the version. Restore the original
extractor that greps the literal URL out of src/aws.js.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

---------

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Closes #13. Part of plan #15.

## Motivation

Phase 4's dogfood failures took three iterations and a full revert to
diagnose because the ephemeral EC2 instance terminates before anyone
can look at cloud-init logs. Structured action-side logs are the
cheap half of the diagnostic story (see #20 for the AWS-side upload-
to-S3 trap that completes it).

## New module: src/log.js

JSON-shaped events via core.info / core.warning / core.error:

    log.info('run_instance', {
      ami_id: 'ami-0x',
      instance_type: 't3.medium',
      subnet_id: 'subnet-aaa',
      sg_id: 'sg-bbb',
      label: 'abcde',
    });
    // -> ::notice ::{"step":"run_instance","mode":"start",...}

Features:
- Lazy require of ./config so validation-phase log calls don't crash
  when config's try/catch hasn't finished running new Config().
- sanitize() redacts values under a small secret-key allowlist before
  emitting (githubToken, github-token, token, aws-access-key-id,
  aws-secret-access-key, GPG_PRIVATE_KEY, password).
- log.debug(...) is a no-op unless config.input.debug === 'true'.

## New input: debug

action.yml declares 'debug: false' as default. Consumers opt in by
setting debug: 'true' in their workflow; the action then emits the
extra diagnostic traffic (full DescribeImages response, wait_for_runner
poll details, input echo) without changing default output.

src/config.js reads the input with a 'false' fallback so old callers
that don't set it continue to work.

## Call sites instrumented

src/aws.js:
- wait_for_instance (start + end + error)
- describe_images (request + selection + full-list under debug)
- run_instance (request + result + error)
- associate_address (attempt + warn on failure)
- terminate_instance (start + end + error)

src/gh.js:
- gh_registration_token (attempt + success + error)
- remove_runner (attempt + success + skip + error)
- wait_for_runner (final outcome + per-poll details under debug)

src/index.js:
- start/stop bracketed by core.startGroup/endGroup so the Actions
  log collapses cleanly.
- Top-level 'fatal' log on uncaught failure.
- start_inputs / stop_inputs logged under debug (sanitized).

## Tests

tests/log.test.js — 7 new cases:
- info / warn / error emit through the right core.* channels with
  step + mode + fields baked into the JSON payload.
- debug emits nothing when config.input.debug !== 'true'.
- debug emits when config.input.debug === 'true'.
- sanitize redacts recursively.
- Payload-level redaction works (info with a {githubToken: …} field).

tests/config.test.js — 2 new cases for the debug input (default +
explicit override).

Total: 23 -> 30 tests.

## Unblocks

Phase 4.b (non-root retry, #20) now has visibility into what the
bootstrap is doing. Combined with the S3-upload trap on the AWS side
(also scoped in #20), any future bootstrap regression is diagnosable
from outside the instance.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Closes #11. Part of plan #15.

## Scope

Add a small retry helper and wrap the two cleanup paths in stop()
with it. Make stop() attempt both cleanups independently so a
GitHub-side failure doesn't prevent EC2 termination and vice versa.

## New module: src/retry.js

  withRetry(step, fn, { attempts, baseMs, maxMs })

- Default: 3 attempts, 2s base, doubled per retry, capped at 10s.
  Worst-case total wait: 2s + 4s + 8s = 14s before giving up.
- Logs each retry via log.warn() with the step name suffixed
  '_retry' so the Actions run log shows attempt sequence.
- On final failure, emits log.error with exhausted: true and
  re-throws the last error.

## Call sites

- gh.removeRunner() — the DELETE /runners/{id} request wrapped in
  withRetry('remove_runner', ...). Idempotent: deleting a runner
  that's already gone returns 404 which is a permanent failure
  (not transient) but the existing get-runner-first path filters
  that out before the DELETE.

- aws.terminateEc2Instance() — the TerminateInstances command
  wrapped in withRetry('terminate_instance', ...). Idempotent:
  terminating an already-terminated instance is a no-op per EC2.

- start() path deliberately not wrapped. RunInstances is not
  idempotent without a ClientToken (billing implication) and the
  runner-version-specific AMI lookup retrying makes no sense.
  If start fails, the whole action fails; the consumer's workflow
  fallback handles retry at the workflow level.

## stop() independence

Before: 'await A; await B'. If A throws, B never runs.

After: both wrapped in try/catch, failures accumulated into a list,
and at the end — if any failed — one Error is thrown summarizing
both. So the EC2 instance gets terminated even when GitHub's API
is temporarily flaky, and vice versa.

## Tests

tests/retry.test.js — 4 cases:
- Resolves on first-try success.
- Retries on rejection and returns the eventual success value.
- Exhausts attempts and re-throws after emitting warn + error logs.
- Backoff caps at maxMs (verified by inspecting the delay values
  emitted in warn logs).

Total: 30 -> 34 tests.

## Consumer impact

Transparent. Default behavior matches what the old code did for
the happy path. The retry / independent-cleanup changes only affect
transient-failure flows that previously surfaced as 'stop-runner
failed' jobs.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Part of #12. Phase 6 split: this PR ships IMDSv2 enforcement only.
EBS encryption (also scoped in #12) becomes a Phase 6.b follow-up
that needs per-AMI root-device lookup to set BlockDeviceMappings
safely (wrong DeviceName creates an unused extra volume or clobbers
the root).

## Change

RunInstances params now include:

  MetadataOptions: {
    HttpTokens: config.input.httpTokens,
    HttpPutResponseHopLimit: 1,
    HttpEndpoint: 'enabled',
  }

- HttpTokens default 'required' — IMDSv2 session token mandatory on
  every metadata request. Mitigates SSRF to IAM-credential theft.
- HttpPutResponseHopLimit: 1 — token can't cross into a nested
  container, cap the blast radius of a containerized workload that
  tries to walk metadata.
- HttpEndpoint: 'enabled' — explicit; default but good to pin.

## New input

'http-tokens' in action.yml, default 'required', accepted values
'required' and 'optional'. Consumers who must keep IMDSv1
compatibility set 'optional' explicitly.

## Tests

tests/config.test.js — 2 new cases: default fallback + override.
Total 34 -> 36.

## Consumer impact

Transparent for code running on the runner that uses IMDS via a
modern SDK (aws-sdk v2/v3, SSM agent, cloud-init) — all support
IMDSv2 without config. Risk: an old tool or hand-rolled script
that hits 169.254.169.254/latest/meta-data without first PUTting
for a token. None expected on the hardened AL2023 AMI.

## Dogfood

Small single-knob change — should pass the provider acctest cleanly
once rotated. Phase 6.b (EBS encryption) will be scoped separately.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
#25)

Closes #8 and #9. Documentation-only; no code change.

## Phase 2 (#8) — OIDC for AWS credentials

README's 'How to start' step 1 now leads with GitHub OIDC as the
preferred path and relegates static IAM access keys to a 'legacy'
Option B. Rationale: static keys don't rotate, live in GitHub
secrets indefinitely (permanent attack surface), and can't be
scoped to a specific repo/branch/environment. OIDC issues
short-lived STS tokens per run, scoped by repo/branch/environment.

Includes:
- A Terraform example for the trust-relationship IAM role with
  a repo-scoped 'sub' StringLike condition.
- The minimum permissions policy (unchanged — attaches to role
  or user).
- A workflow snippet showing 'permissions: id-token: write' and
  'role-to-assume' instead of access-key secrets.

No changes to the action's code — it already reads AWS creds from
env, which configure-aws-credentials@v6 populates identically under
both paths.

## Phase 3 (#9) — GitHub token type preferences

README's 'How to start' step 2 replaced. Three token options
ordered by preference:

- A (preferred): GitHub App installation token via
  actions/create-github-app-token. No human identity, short-lived,
  minimal permission (Repository Administration: read/write).
- B: fine-grained PAT scoped to specific repos with just Repository
  Administration. Better than classic PAT but still tied to a
  human.
- C (deprecated): classic PAT with 'repo' scope. Over-permissive
  and human-tied; kept in docs as a fallback for environments
  that don't allow Apps or fine-grained PATs.

No code change — the action accepts any token that has permission
to manage self-hosted runners on the target repo. Docs change
only.

## Not included in this PR

- Phase 2's optional 'role-to-assume' input on the action itself
  (so consumers don't need the separate configure-aws-credentials
  step). Deferred — the current dual-step pattern is standard and
  works fine. Convenience feature, not urgency.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…cksum table (#26)

Closes #20. Supersedes the reverted #18 / #19 / #21.

Implements the full Phase 4 bootstrap hardening from issue #10, with
the root-cause fix from #20 baked in. Key differences from the
earlier failed attempts:

## The fix for the actual failure

Previous attempts died at:

    curl -fsSL <tarball>.sha256 | awk '{print }'

with a 404 (actions/runner doesn't publish per-tarball sidecar files,
empirically confirmed via aws ec2 get-console-output on a probe
instance — see #20).

This PR replaces that with a hardcoded table of expected hashes in
src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64
entries for the currently-pinned v2.333.1, sourced from the release
body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces
table-vs-upstream consistency on every PR (see pr.yml).

## Everything else from Phase 4

- Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc).
  RUNNER_ALLOW_RUNASROOT=1 escape hatch removed.

- New 'runner-version' input in action.yml (default '2.333.1'). To
  override, add matching x64+arm64 SHAs to runner-checksums.js in
  the same PR — verify-runner-url CI will reject the change if
  the hashes don't match upstream.

- --ephemeral --unattended --disableupdate on config.sh. GitHub
  auto-deregisters the runner after its job; disableupdate keeps
  the binary stable during the short ephemeral session.

- set -euo pipefail on both the outer and inner (runner-user) shells.
  The earlier fatal failure under set -e was the .sha256 404, which
  no longer exists.

- Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars.

## Tests

tests/runner-checksums.test.js — 6 new cases covering the table
shape, hex format, x64+arm64 parity per version, lookup returns for
known/unknown keys.

tests/config.test.js — 2 new cases for the runner-version input
(default fallback + override).

Total: 36 -> 44 tests.

## CI: verify-runner-url overhaul

The job now parses the runner-version from action.yml, then:
1. HEADs the Linux x64 release asset (unchanged).
2. Fetches the release body via 'gh api'.
3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments.
4. Cross-checks against the values lookup() returns from
   src/runner-checksums.js.

Drift between the hardcoded table and upstream fails CI at code-
review time, not at runtime.

## Dogfood plan (MUCH more careful this time)

Provider SHA-pin rotation after merge, same pattern as prior phases.
This time I have full EC2 console-output diagnostic capability via
the recipe saved in my notes — any new bootstrap failure should be
trivially diagnosable rather than opaque.

Closing #20 on merge.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Rounds out Phase 6 (IMDSv2 landed in #24, EBS encryption deferred
until a per-AMI root-device lookup could be done safely).

## Change

New 'encrypt-ebs' input on action.yml, default 'false' (opt-in).
When 'true', the action:

1. Fetches the AMI's DescribeImages result (already needed to
   resolve image IDs when 'ec2-image-filters' is set).
2. Finds the BlockDeviceMapping matching the AMI's RootDeviceName.
3. Clones that mapping, drops SnapshotId (AWS uses the AMI's
   snapshot automatically), sets 'Encrypted: true'.
4. Passes the cloned mapping as RunInstances.BlockDeviceMappings.

Result: root volume launches with SSE-EBS, key 'alias/aws/ebs' in
the launch account. VolumeSize / VolumeType / IOPS / DeleteOnTermination
preserved from the AMI — only the encryption bit is new.

## Why opt-in

The launch account (not necessarily the AMI owner account) must
have either default EBS encryption enabled, or at minimum permission
to use the default AWS-managed KMS key. If the AMI's snapshot is
encrypted with a customer-managed key that doesn't have a cross-
account grant, RunInstances fails. Defaulting to 'true' would
regress every consumer whose IAM / KMS policy isn't set up for
this. Default 'false' lets each consumer opt in after verifying
their account can handle it.

## Why not account-level default encryption

AWS supports 'aws ec2 enable-ebs-encryption-by-default' at the
account level — and that's the preferred belt-and-suspenders. But
not every consumer runs in an AWS account they control (e.g.,
Namecheap's CI runs in a shared org account). Action-side opt-in
is the only portable control.

## Refactor alongside

resolveImageId -> resolveImage: now returns both the ID and the
full Image metadata. Callers that only need the ID use .id; the
EBS-encryption code path uses .image.BlockDeviceMappings to build
the encrypted clone.

## Tests

tests/ebs.test.js — 6 new cases for buildEncryptedRootMapping:
happy path with full EBS config + non-EBS sibling mapping, volume
type / size / iops preservation, and five null-return paths for
exotic AMI shapes (no RootDeviceName, no mappings, non-EBS root,
orphan RootDeviceName).

tests/config.test.js — 2 new cases for the encrypt-ebs input
(default fallback + override).

Total: 44 -> 52 tests.

## Consumer dogfood

Separate PR on terraform-provider-namecheap rotates the pin and
enables 'encrypt-ebs: true' on the CI job. If the dogfood fails
with a KMS / IAM permissions error, we know the account needs
policy work before enabling; action-side code is fine either way.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Comment thread .github/workflows/pr.yml
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: 20
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to 24 here?

Copy link
Copy Markdown
Author

@dm1tr1yvovk dm1tr1yvovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #28 feat/al2023-support

Strengths

  1. Clean AWS SDK v3 migration — Import-only-what-you-need command pattern (client.send(new XCommand(...))) is correct and reduces bundle size. (src/aws.js:1-8)

  2. Robust security hardening — IMDSv2 enforcement with HttpPutResponseHopLimit: 1 is the correct SSRF defense. Default-on with http-tokens: optional escape hatch is a good backward-compat trade-off. (src/aws.js:201-210)

  3. Non-root runner + ephemeral mode — Running as a dedicated runner user with --ephemeral --disableupdate is a significant security win. The heredoc-quoted <<'RUNNER_BOOTSTRAP' correctly prevents shell expansion of untrusted values. (src/aws.js:138-186)

  4. Checksum verification is enforced, not advisorysha256sum -c - runs under set -euo pipefail, killing bootstrap on mismatch. The CI job verify-runner-url cross-checks the hardcoded table against upstream releases. Two-layer defense, well designed. (src/aws.js:177, .github/workflows/pr.yml:92-125)

  5. Independent cleanup in stop mode — Catching errors from terminateEc2Instance() and removeRunner() independently prevents a GitHub API outage from blocking EC2 termination (and thus preventing billing runaway). (src/index.js:54-84)

  6. Structured logging with sanitization — The log.js module with key-based secret redaction and deferred require('./config') inside emit() correctly handles circular dependency with module load order. (src/log.js:28-41)

  7. Comprehensive test coverage — All 52 tests pass. Covers config validation, EBS mapping edge cases (6 scenarios), retry backoff, checksum integrity, and log sanitization.

  8. CI pipeline additionsverify-dist, unit-tests, and verify-runner-url jobs catch real defect classes at review time.


Issues

Critical (Must Fix)

C1. _.merge() mutates config.githubContext — stale runner_id on subsequent calls

  • File: src/gh.js:55
  • What's wrong: _.merge(config.githubContext, { runner_id: runner.id }) mutates its first argument in place. After removeRunner() runs, config.githubContext permanently gains a stale runner_id property, which will contaminate any subsequent logging or API calls using that object.
  • Fix: Use non-mutating merge: { ...config.githubContext, runner_id: runner.id } or _.merge({}, config.githubContext, { runner_id: runner.id })

C2. Missing ec2:DescribeImages IAM permission in documented policy

  • File: README.md:131-143
  • What's wrong: resolveImage() now always calls DescribeImagesCommand even when ec2-image-id is provided directly (to fetch BlockDeviceMappings for EBS encryption). But ec2:DescribeImages is not in the documented minimum IAM policy. Users upgrading will get AccessDenied errors.
  • Fix: Add ec2:DescribeImages to the README's IAM policy block.

Important (Should Fix)

I1. Stale DEP0169 suppression after SDK v3 migration

  • File: src/index.js:1-14
  • What's wrong: The url.parse() deprecation suppression was added for aws-sdk v2. The comment says "migrating to aws-sdk v3 would require rewriting the entire action" — but that migration is already done. Dead code with a misleading comment.
  • Fix: Remove lines 1-14 entirely.

I2. waitForRunnerRegistered race — both reject and resolve can fire

  • File: src/gh.js:82-93
  • What's wrong: When timeout fires, reject() on line 86 has no return, so execution falls through to line 89 where resolve() may also fire if the runner registered at the exact timeout boundary.
  • Fix: Add return; after reject(...) on line 86.

I3. New inputs not documented in README inputs table

  • File: README.md:269-282
  • What's wrong: Four new inputs (runner-version, encrypt-ebs, http-tokens, debug) are in action.yml but absent from the README's Inputs table. Users won't discover the new security features.

I4. README still references "Amazon Linux 2" instead of AL2023

  • File: README.md:275
  • What's wrong: Says "compatible with Amazon Linux 2 images" — contradicts the entire purpose of this PR.

I5. No validation of http-tokens input value

  • File: src/config.js:20
  • What's wrong: Passed directly to MetadataOptions.HttpTokens without validation. A typo like require instead of required produces a confusing AWS API error at instance launch rather than a clear validation error at action start.
  • Fix: if (!['required', 'optional'].includes(this.input.httpTokens)) throw new Error(...)

I6. ec2Client() creates a new client on every call despite "singleton" comment

  • File: src/aws.js:16-22
  • What's wrong: Comment says "A single shared client is fine" but the function creates a new EC2Client per invocation. Wastes credential resolution cycles. Comment and behavior are contradictory.
  • Fix: Either make it a true module-scope singleton or update the comment.

Minor (Nice to Have)

M1. src/gh.js:86reject() with a plain string instead of Error object. The error handler in index.js:90 accesses .name and .message, which will be undefined.

M2. package.json:21-22@actions/core at ^1.2.6 (2020) and @actions/github at ^4.0.0 (2021) are very outdated. Upgrading to @actions/core v1.10+ would eliminate the custom setOutput() function.

M3. README.md:319,324-325 — Example workflow still uses aws-actions/configure-aws-credentials@v1 and static IAM keys, contradicting the new OIDC-preferred guidance.

M4. src/aws.js:212encryptEbs === 'true' string comparison works but core.getBooleanInput() would be more conventional.

M5. src/aws.js:46-51 — Redundant DescribeImages call when ec2-image-id is provided and encrypt-ebs is false. Adds latency and forces the new permission even when not needed. Fixing this also reduces the impact of C2.


Recommendations

  1. Add a manual integration test workflow (workflow_dispatch) that creates/tears down a test runner to catch bootstrap failures, IAM permission gaps, and SDK issues that unit tests can't.

  2. Pin @aws-sdk/client-ec2 with caret range (^3.1033.0 instead of exact 3.1033.0) to receive patch/security fixes, or document the rationale for the exact pin.

  3. Document KMS permissions for encrypt-ebs — With the default AWS-managed key, the role needs kms:CreateGrant and kms:DescribeKey on alias/aws/ebs.

  4. Consider removing lodash — The only usages are _.filter and _.merge, both trivially replaceable with native JS. This would eliminate a transitive dependency.


Assessment

Ready to merge? No — with fixes.

Reasoning: The implementation is architecturally sound and delivers significant security and reliability improvements. The two critical issues must be addressed: C1 (_.merge mutation) is a correctness bug, and C2 (missing IAM permission in docs) is a breaking change for existing users. The stale DEP0169 suppression (I1) and the timeout race condition (I2) should also be fixed before merge. Remaining items can go in follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants