Skip to content

fix: revert non-root runner bootstrap, keep the rest of Phase 4#19

Merged
kurok merged 1 commit into
feat/al2023-supportfrom
fix/4-revert-nonroot
Apr 21, 2026
Merged

fix: revert non-root runner bootstrap, keep the rest of Phase 4#19
kurok merged 1 commit into
feat/al2023-supportfrom
fix/4-revert-nonroot

Conversation

@kurok
Copy link
Copy Markdown

@kurok kurok commented Apr 21, 2026

Part of #10's fallout — see also provider-side dogfood failure at terraform-provider-namecheap#182.

What happened

#18 (Phase 4) landed three independent hardenings in one PR:

  1. New runner-version input (parametric, no runtime impact).
  2. Ephemeral + checksum + set -euo pipefail (additive safety).
  3. Root → non-root runner user via sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc (behavioral change).

Dogfood rotation on the provider (machulav#182) then failed: Start self-hosted EC2 runner timed out at 6m15s waiting for runner registration. The instance booted, but the user-data never got to ./run.sh polling GitHub. Can't post-mortem — the instance is ephemeral and already terminated.

What this PR does

Revert only item (3) — the sudo-to-runner wrapping, because that's the highest-probability culprit among the three axes. Keep (1) + (2).

Kept Reverted
runner-version input useradd runner
set -euo pipefail sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc
--ephemeral --unattended --disableupdate RUNNER_ALLOW_RUNASROOT=1 escape hatch restored
SHA-256 verification of runner tarball
Parameterized RUNNER_VERSION / TARBALL / BASE bash vars
Clearer case "$(uname -m)" syntax, double-quoted vars

Follow-up for the non-root goal

Not abandoning the hardening. A new issue will propose the non-root move with better instrumentation — e.g., output the user-data script to CloudWatch or a bucket before it executes, so the EC2 console log shows progress and we can see where it breaks. Likely candidates to investigate when we return to this:

  • sudoers config on the DEVOPS/hardened-amazon-linux2023 AMI (requiretty, default Defaults env_reset, etc.).
  • SELinux context for the runner user's home.
  • config.sh writing state files outside its own directory.
  • My heredoc quoting + bash's interaction with sudo's stdin reading.

Verification

  • Dogfood rotation PR will be opened after this merges (namecheap/terraform-provider-namecheap ci.yml pin → new feat/al2023-support tip). If it goes green, (1)+(2) are vindicated and Phases 5/6/7 can stack on top. If it still fails, the regression is in items (1) or (2) and I'll investigate from there.

Phase 4 (#18) landed three independent hardenings in one PR:
- New configurable runner-version input (no runtime impact)
- Ephemeral + checksum + set -euo pipefail (additive safety)
- Root to non-root runner user via sudo-heredoc (behavioral change)

The dogfood rotation on terraform-provider-namecheap#182 failed —
'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner
registration. EC2 instance booted fine, but whatever the user-data
did inside the instance, it didn't end at './run.sh' polling GitHub.

We can't post-mortem directly because the instance is ephemeral and
already terminated. Fix-forward strategy: revert ONLY the non-root
transition (highest-probability culprit among the three axes), keep
everything else from Phase 4.

If the Phase 4 dogfood rotation passes after this revert, the
root-to-runner sudo-heredoc is the breaker and can be investigated
as an isolated follow-up (likely candidates: sudoers config on the
hardened AMI, SELinux context, config.sh writing outside its own
directory, or my heredoc quoting). Landing the safer pieces now
unblocks Phases 5/6/7.

Kept:
- runner-version input (Phase 4's main feature)
- set -euo pipefail
- --ephemeral + --unattended + --disableupdate on config.sh
- SHA-256 verification of the runner tarball
- Clearer bash syntax ('case "$(uname -m)"', double-quoted vars)

Reverted:
- useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc
- RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again)

The non-root goal isn't lost — a follow-up issue will propose it
again, this time with better instrumentation so we can see what
failed inside the instance.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
@kurok kurok merged commit 78f98d1 into feat/al2023-support Apr 21, 2026
4 checks passed
@kurok kurok deleted the fix/4-revert-nonroot branch April 21, 2026 08:40
kurok added a commit that referenced this pull request Apr 21, 2026
* revert: full rollback of Phase 4 bootstrap changes

Phase 4 attempts #18 (with non-root) and #19 (without non-root but
keeping --ephemeral + checksum + set -euo pipefail + runner-version
input) BOTH failed the provider dogfood with the same 6m15s runner
registration timeout (terraform-provider-namecheap#182 and machulav#183).

The fix-forward in #19 narrowed the suspect set from 'all Phase 4
changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate
flag, checksum verify, parameterized bash vars'. Still not isolated.

Full rollback here restores the known-good Phase 1 bootstrap exactly.
Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38,
jest tests, .gitattributes).

Phase 4 work is NOT abandoned — it moves to follow-up issues where
each change lands on its own with its own dogfood, so the next
failure isolates itself to a single axis instead of requiring
bisection across five simultaneous changes.

Files reverted to match a1bd2f9 (Phase 1 tip):
- action.yml (drops runner-version input)
- src/aws.js (original 12-line bash array, yum install libicu make,
  RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify)
- src/config.js (drops runnerVersion field)
- tests/config.test.js (drops runner-version test block, 23 -> 21 tests)

Dist rebuilt against the reverted src (verify-dist will confirm).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

* ci: revert verify-runner-url extractor to grep src/aws.js

Paired with the full Phase 4 revert — now that action.yml no longer
has a runner-version default, the Phase 4 version of verify-runner-url
that reads action.yml can't find the version. Restore the original
extractor that greps the literal URL out of src/aws.js.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

---------

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
kurok added a commit that referenced this pull request Apr 21, 2026
…cksum table (#26)

Closes #20. Supersedes the reverted #18 / #19 / #21.

Implements the full Phase 4 bootstrap hardening from issue #10, with
the root-cause fix from #20 baked in. Key differences from the
earlier failed attempts:

## The fix for the actual failure

Previous attempts died at:

    curl -fsSL <tarball>.sha256 | awk '{print }'

with a 404 (actions/runner doesn't publish per-tarball sidecar files,
empirically confirmed via aws ec2 get-console-output on a probe
instance — see #20).

This PR replaces that with a hardcoded table of expected hashes in
src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64
entries for the currently-pinned v2.333.1, sourced from the release
body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces
table-vs-upstream consistency on every PR (see pr.yml).

## Everything else from Phase 4

- Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc).
  RUNNER_ALLOW_RUNASROOT=1 escape hatch removed.

- New 'runner-version' input in action.yml (default '2.333.1'). To
  override, add matching x64+arm64 SHAs to runner-checksums.js in
  the same PR — verify-runner-url CI will reject the change if
  the hashes don't match upstream.

- --ephemeral --unattended --disableupdate on config.sh. GitHub
  auto-deregisters the runner after its job; disableupdate keeps
  the binary stable during the short ephemeral session.

- set -euo pipefail on both the outer and inner (runner-user) shells.
  The earlier fatal failure under set -e was the .sha256 404, which
  no longer exists.

- Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars.

## Tests

tests/runner-checksums.test.js — 6 new cases covering the table
shape, hex format, x64+arm64 parity per version, lookup returns for
known/unknown keys.

tests/config.test.js — 2 new cases for the runner-version input
(default fallback + override).

Total: 36 -> 44 tests.

## CI: verify-runner-url overhaul

The job now parses the runner-version from action.yml, then:
1. HEADs the Linux x64 release asset (unchanged).
2. Fetches the release body via 'gh api'.
3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments.
4. Cross-checks against the values lookup() returns from
   src/runner-checksums.js.

Drift between the hardcoded table and upstream fails CI at code-
review time, not at runtime.

## Dogfood plan (MUCH more careful this time)

Provider SHA-pin rotation after merge, same pattern as prior phases.
This time I have full EC2 console-output diagnostic capability via
the recipe saved in my notes — any new bootstrap failure should be
trivially diagnosable rather than opaque.

Closing #20 on merge.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant