Skip to content

feat(ci): migrate npm publishing to OIDC Trusted Publishers#1592

Open
alinarublea wants to merge 14 commits into
mainfrom
feat/oidc-trusted-publishers
Open

feat(ci): migrate npm publishing to OIDC Trusted Publishers#1592
alinarublea wants to merge 14 commits into
mainfrom
feat/oidc-trusted-publishers

Conversation

@alinarublea
Copy link
Copy Markdown
Contributor

@alinarublea alinarublea commented May 6, 2026

Summary

Migrates npm publishing for adobe/spacecat-shared from a long-lived ADOBE_BOT_NPM_TOKEN secret to npm OIDC Trusted Publishers.

  • Upgrades npm CLI from 10.9.411.13.0 (OIDC requires >= 11.5.1) in both main.yaml and the setup-node-npm composite action
  • Removes NPM_TOKEN: ${{ secrets.ADOBE_BOT_NPM_TOKEN }} from the Semantic Release and dry-run steps — publishing now uses OIDC token exchange
  • Scopes id-token: write to per-job permissions (test job needs it for AWS ECR OIDC + npm OIDC dry-run; release job for publishing) — workflow-level permissions are now {}
  • Adds NPM_CONFIG_PROVENANCE: 'true' to the release job — sigstore provenance attestations are emitted with every publish
  • Adds scripts/setup-npm-trusted-publishers.sh — one-time script to configure 22 published packages on npmjs.com (spacecat-shared-example template excluded)
  • Adds workflow filename + branch-protection warning comment in main.yaml

Resolves SITES-42702

Prerequisites before merging

The scripts/setup-npm-trusted-publishers.sh script must be run before this PR is merged by someone authenticated to npm as the adobe-bot account:

npm install -g npm@11.13.0
npm login                # authenticate as adobe-bot
./scripts/setup-npm-trusted-publishers.sh

The script enforces preflight checks (npm version, whoami == adobe-bot, registry, workflow file existence) and tracks per-package failures so partial registrations don't silently abort.

Rollback plan

ADOBE_BOT_NPM_TOKEN should remain in GitHub repo secrets for at least 2 successful release cycles after merge. If OIDC publishing fails on the first release:

  1. Revert this PR (git revert <merge-sha>) — re-adds NPM_TOKEN to the workflow env
  2. The retained secret resumes token-based publishing immediately
  3. npm trust bindings on npmjs.com are non-blocking for token-based publish, so no npm-side cleanup needed

After 2 successful releases, rotate the npm token (revoke its publish scope) and remove the GitHub secret.

Review process

This PR was reviewed by 4 specialized personas (SRE, Security, Senior Engineer, QA). All round-1 and round-2 MUST/SHOULD findings within scope have been implemented:

  • ✅ Per-job id-token: write scoping
  • ✅ NPM_CONFIG_PROVENANCE for sigstore attestations
  • ✅ Workflow rename + branch-protection warning
  • ✅ Setup script: preflight checks (version, whoami, registry, workflow file), per-package failure tracking, idempotency notes
  • ✅ Removed spacecat-shared-example from trust list
  • ✅ Stale step names cleaned up

Deferred to follow-up PRs (acknowledged out of scope):

  • Pin GitHub Actions to commit SHAs (actions/checkout@v6 etc.)
  • Replace ADOBE_BOT_GITHUB_TOKEN PAT with GitHub App installation token
  • Restrict on: [push] trigger to specific branches
  • Add npm publish --dry-run --provenance smoke-test step on PRs
  • Periodic trust-drift detection cron
  • Auto-derive package list from workspace metadata

Test plan

  • Run scripts/setup-npm-trusted-publishers.sh locally to configure all 22 packages as Trusted Publishers
  • Verify CI passes on this branch (test + dry-run steps run without NPM_TOKEN)
  • After merge, verify the first release to main succeeds and packages publish via OIDC with provenance attestations (npm view @adobe/<pkg> --json | jq '.dist.attestations')
  • After 2 successful releases, rotate and remove ADOBE_BOT_NPM_TOKEN from GitHub secrets

🤖 Generated with Claude Code

alinarublea and others added 4 commits May 6, 2026 17:04
Eliminates the long-lived ADOBE_BOT_NPM_TOKEN by switching to npm
OIDC Trusted Publishers (npm >= 11.5.1). Adds a one-time setup
script to configure all 23 packages on npmjs.com.

- Upgrade npm CLI from 10.9.4 to 11 in workflow and setup-node-npm action
- Remove NPM_TOKEN from Semantic Release and dry-run steps
- Add scripts/setup-npm-trusted-publishers.sh for one-time npm trust setup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scope id-token:write to per-job permissions (test + release) instead
  of workflow-level; each job documents exactly what it needs
- Add NPM_CONFIG_PROVENANCE=true to release job for sigstore attestations
- Add workflow-rename warning comment (trusted publisher binds to filename)
- Fix stale step name "Set up Node.js 22" -> "Set up Node.js"
- Fix misleading "Update NPM (only if cache miss)" -> "Update NPM" in action
- Remove @adobe/spacecat-shared-example from trusted publishers list
  (template package, not intended for automated publishing)
- Add npm version preflight check (>= 11.10.0) to setup script
- Add npm whoami preflight (must run as adobe-bot account)
- Replace set -e abort with per-package failure tracking; FAILED array
  collects all errors and script exits non-zero with full list
- Add note that npm trust is idempotent (safe to re-run after partial failure)
- Add comment explaining --file takes filename only, not full path
- Add note on new-package workflow in script header

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix package count in workflow comment: 23 -> 22 (example excluded)
- Add branch-protection note: trust binding security depends on it
- Add npm registry preflight (must be registry.npmjs.org, not a mirror)
- Add workflow file existence preflight (catches WORKFLOW constant drift)
- Capture npm trust output explicitly with rc check; print indented for clarity
  instead of relying on shell short-circuit semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alinarublea alinarublea requested a review from solaris007 May 6, 2026 15:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

This PR will trigger a minor release when merged.

alinarublea and others added 3 commits May 6, 2026 18:07
…ured

@semantic-release/npm 13.x tries OIDC exchange during verifyConditions,
even in dry-run mode. Without trust bindings registered on npmjs.com
(setup-npm-trusted-publishers.sh not yet run), the exchange returns
404 "package not found", then falls back to NPM_TOKEN — which has
been removed. This breaks CI on all non-main branches.

Add SR_NO_NPM_AUTH env var to the dry-run workflow step and guard the
@semantic-release/npm plugin in all .releaserc.cjs files so dry-run
skips npm entirely. Release job on main is unaffected (no SR_NO_NPM_AUTH).

Once trusted publishers are configured and OIDC publishing confirmed
working, this guard can be removed and the dry-run can exercise the
full npm plugin path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the thorough setup script and rollback plan. The migration approach is sound and the security improvement is real, but two issues block this PR from working as designed: a typo in the npm CLI subcommand will cause the setup script to fail outright, and the workflow's documented branch-protection trust ceiling does not exist on the repo today.

Strengths

  • Workflow-level permissions: {} with per-job id-token: write scoping (.github/workflows/main.yaml:10, :18-22, :50-54) is the correct least-privilege posture. Previously workflow-level id-token: write made the token mintable from every job.
  • persist-credentials: 'false' on both checkout steps (main.yaml:33, :62) prevents GITHUB_TOKEN from being persisted into git config.
  • Workflow header comment (main.yaml:3-9) puts the rename hazard and branch-protection dependency at the point of risk where editors will see them.
  • Setup script preflight is rich (scripts/setup-npm-trusted-publishers.sh:30-67): npm version, registry URL, whoami == adobe-bot, and workflow-file existence all checked before any mutation.
  • Per-package failure tracking (setup-npm-trusted-publishers.sh:96-117) with set -uo pipefail (no -e) plus the FAILED array is the right shape for an idempotent batch operation.
  • NPM_CONFIG_PROVENANCE: 'true' (main.yaml:65) ties sigstore attestations to every publish - real supply-chain integrity gain for downstream consumers.
  • NPM_TOKEN fully removed from workflow env (verified across the diff: zero remaining references in .github/workflows/ or composite actions).
  • Dry-run is correctly gated by if: github.ref != 'refs/heads/main' AND SR_NO_NPM_AUTH=true - removes the original concern that PR branches could trigger token-bearing dry-runs.
  • Rollback path is concrete and reversible: keep ADOBE_BOT_NPM_TOKEN for 2 cycles, revert restores token-based publishing.

Issues

Critical (Must Fix)

1. scripts/setup-npm-trusted-publishers.sh:99 - wrong npm trust subcommand; script will fail at runtime.

The script invokes npm trust github-actions "${pkg}" --repository --file --yes. The actual npm CLI subcommand is npm trust github, not github-actions. Verified against npm/cli source at lib/commands/trust/index.js:

static subcommands = {
  github: require('./github.js'),
  gitlab: require('./gitlab.js'),
  circleci: require('./circleci.js'),
  list: require('./list.js'),
  revoke: require('./revoke.js'),
}

There is no github-actions alias. Running this script as-written will produce Unknown subcommand: github-actions for every package, populate FAILED with all 22, and exit non-zero before configuring any trust binding.

This is the entire purpose of the PR. The "prerequisites before merging" step in the PR body cannot be satisfied as written. If merged before someone notices, the first release after merge will fail with no way to recover except by reverting (NPM_TOKEN already gone from workflow).

Fix: change line 99 from npm trust github-actions "${pkg}" \ to npm trust github "${pkg}" \. Update the comment in the script header that references the same subcommand. Run the corrected script end-to-end against at least one package and capture the output before merging.

2. .github/workflows/main.yaml:7-9 - branch protection asserted but does not exist; trust ceiling is missing.

The workflow comment states Trust binding security relies on branch protection: do NOT weaken protection on main, or any contributor with push access could mint a publish-scoped npm token via OIDC. Verified empirically:

$ gh api repos/adobe/spacecat-shared/branches/main/protection
{"message":"Branch not protected","status":"404"}
$ gh api repos/adobe/spacecat-shared/rules/branches/main
[]

There is no branch protection AND no rulesets on main. Combined with the structure of npm trust bindings (verified in npm/cli lib/commands/trust/github.js - bindings store { repository, workflow_ref.file } only, no branch or ref constraint), this means:

  • Anyone with write access to adobe/spacecat-shared can push directly to main or merge a PR that modifies main.yaml to bypass the if: github.ref == 'refs/heads/main' gate. The OIDC token from any branch's run of this workflow will satisfy the trust binding's claims.
  • The workflow's if: github.ref == 'refs/heads/main' is the only ref constraint, and a single-line PR can remove it.

Mitigations, in order of strength:

  • (Strongest, code-fixable in this PR) Bind a GitHub Environment (e.g. npm-publish) to the release job, configure its deployment branch policy to allow only main, and add --environment npm-publish to the npm trust github invocations in the setup script. This makes the environment claim part of the trust binding so a forked workflow without it cannot mint publish tokens.
  • (Minimum, repo-admin) Configure branch protection or a ruleset on main: disallow direct push, require PR review, disallow force-push. The workflow's stated assumption then actually holds.

This should land before any trust bindings are configured. Right now, even after the typo fix, the security model the PR describes is not the security model the repo enforces.

Important (Should Fix)

1. packages/spacecat-shared-example is excluded from trust bindings but still has the SR_NO_NPM_AUTH shim - latent break on next release.

Three signals disagree:

  • setup-npm-trusted-publishers.sh:69-72 excludes @adobe/spacecat-shared-example from PACKAGES with the comment "template package, not intended for automated publishing".
  • packages/spacecat-shared-example/.releaserc.cjs:13 was updated with the SR_NO_NPM_AUTH shim, so it loads @semantic-release/npm whenever the env var is unset.
  • packages/spacecat-shared-example/package.json is not marked "private": true and currently carries a real version.

In the release job (where SR_NO_NPM_AUTH is NOT set), if conventional-commit history ever includes a feat: or fix: scoped to the example package, semantic-release will attempt to publish via OIDC, find no trust binding, and the release job will fail. Fix options (either is fine):

  • Set "private": true in packages/spacecat-shared-example/package.json and let @semantic-release/npm skip private packages.
  • Drop .releaserc.cjs from packages/spacecat-shared-example/ entirely (semantic-release-monorepo skips packages without a config).

Marking it private is the more idiomatic fix.

2. process.env.SR_NO_NPM_AUTH ? [] : [...] truthy check is a footgun across 24 files.

In JavaScript, the strings "false", "0", "no" are all truthy. If a future maintainer sets SR_NO_NPM_AUTH=false thinking it re-enables the npm plugin, npm publishing is still skipped. The workflow sets 'true' explicitly, so the current behavior is fine; the fragility is in the contract the 24 .releaserc.cjs files now embed.

Fix: change to process.env.SR_NO_NPM_AUTH === 'true' ? [] : [...] in all 24 files. One search-and-replace. Even better: a small node test that loads any one .releaserc.cjs under three env values (undefined, 'true', 'false') and asserts the plugin presence would lock the contract.

3. Dry-run never exercises the OIDC publish path - first main release is the first real test.

main.yaml:46-50 gates the dry-run on if: github.ref != 'refs/heads/main' AND sets SR_NO_NPM_AUTH=true, which removes @semantic-release/npm entirely. On the merge commit to main, only the actual release job runs. There is no PR build and no main build prior to the release step that validates OIDC publish works end-to-end. If trust bindings are misconfigured (wrong workflow filename, missing package, npm-side outage during setup), the failure mode is a broken release on main with revert as the only recovery.

The PR description acknowledges "publish --dry-run --provenance smoke-test on PRs" as deferred, but the cutover risk warrants at minimum: capture the output of running the corrected setup-npm-trusted-publishers.sh against all 22 packages (showing the npm trust list result for each) and attach to the PR or the linked SITES-42702 ticket. Better: a controlled canary publish on a release branch before the workflow change lands on main.

Minor (Nice to Have)

  1. No shellcheck or actionlint on the new artifacts. A 130-line bash script and a workflow modification with no static analysis. Add a CI step shellcheck scripts/*.sh and actionlint .github/workflows/*.yaml - both run in seconds and catch quoting bugs, missing pinned versions, deprecated action references.

  2. Workflow rename has no CI enforcement. The header comment warns about renames, but a renamer 6 months from now will not necessarily read it. A small step in the release job that fails early if .github/workflows/main.yaml is not at the expected path converts a silent publish failure into a loud CI failure.

  3. PACKAGES array drift. A new packages/spacecat-shared-newthing/ added without updating PACKAGES will fail OIDC silently on its first release. ~10 lines at the start of the setup script that diff PACKAGES against find packages -name package.json | xargs jq -r .name (excluding spacecat-shared-example) catches this before any mutation.

  4. SR_NO_NPM_AUTH lifecycle is not pinned. The shim is documented as removable "once OIDC publishing confirmed working" but there's no follow-up issue, no TODO marker, and no log line that would tell anyone whether the shim is still load-bearing. File a follow-up issue with a concrete trigger ("after 2 successful OIDC releases on main, remove SR_NO_NPM_AUTH from all 24 .releaserc.cjs and from main.yaml line 50") and reference it from the workflow comment.

  5. Onboarding documentation for the post-rotation world is missing. The script comment instructs operators to do a "token-based first publish" for new packages, but after the documented rotation removes ADOBE_BOT_NPM_TOKEN from secrets, the token won't exist. Update the comment to reflect the actual workflow: an interactive npm publish from a developer machine logged in as adobe-bot, OR pre-create the empty package on npmjs.com.

  6. Action SHA pinning deferred. actions/checkout@v6, actions/setup-node@v6, and actions/cache@v5 are floating tags. Acknowledged out-of-scope per PR description, but worth tracking explicitly: a tag-overwrite or compromised-action incident on any of these would inject code into the publish job (which has id-token: write and contents: write). The OIDC migration improves credential security but not action supply-chain security.

  7. No alerting on release failures. If the first release fails for any reason (OIDC mint failure, sigstore down, registry 5xx), the only signal is a red workflow run on the default branch. A if: failure() step in the release job that posts to a known Slack channel or opens a GitHub issue would convert an opt-in watcher dependency into an actionable signal.

  8. engines.npm floor doesn't match the OIDC requirement. Root package.json engines.npm permits >=10.9.0 <12.0.0, the workflow installs 11.13.0, the script requires >=11.10.0, OIDC requires >=11.5.1. Four different floors. Bump engines.npm to >=11.5.1 <12.0.0 so npm install enforces what the publish path requires.

Recommendations

  • Add --environment npm-publish to npm trust github invocations once the GitHub Environment is configured. Pairs with the branch-protection fix (Critical 2) and gives defense-in-depth that survives a momentary lapse in branch protection. The npm trust binding will require the environment claim on the OIDC token, which only main-policy environments can mint.
  • Add CODEOWNERS for .github/workflows/main.yaml and scripts/setup-npm-trusted-publishers.sh. Once branch protection requires code-owner review, this constrains who can land workflow-modifying PRs that affect publish trust.
  • Add a 2-second sleep between npm trust calls in the loop. Per the npm v11 bulk usage docs, a 2-second delay is recommended to avoid rate limiting; without it, registering 22 packages in quick succession could hit registry throttling.
  • Track the ADOBE_BOT_NPM_TOKEN rotation explicitly. Open a follow-up issue with a date target ("rotate by 2026-MM-DD if 2 OIDC releases succeed") owned by a person; without a forcing function, retained-but-unused secrets tend to live indefinitely.
  • Document the sigstore runtime dependency in the on-call runbook. NPM_CONFIG_PROVENANCE: 'true' makes sigstore reachability a release prerequisite. Pre-decide whether to disable provenance temporarily during a sigstore outage or to wait it out.
  • Add a node-based config-sync test for the 24 .releaserc.cjs files so future single-file edits cannot silently diverge from the rest. ~30 lines, prevents the most likely class of release-config drift.
  • Audit trail for the setup script. Have the script print a final summary line with git rev-parse HEAD, operator's npm whoami, and timestamp, instructing the operator to paste it into the SITES-42702 ticket. Cheap, makes the trust-establishment ceremony auditable.
  • Verify provenance attestations after the first publish. Already in the PR test plan (npm view @adobe/<pkg> dist.attestations); make it a checkbox item with a named reviewer rather than a generic "after merge" step.

Out of scope, worth tracking

  • Replacing ADOBE_BOT_GITHUB_TOKEN PAT with a GitHub App installation token (acknowledged in PR description).
  • Restricting on: [push] trigger to specific branches (acknowledged).
  • Periodic trust-drift detection cron on npmjs.com bindings (acknowledged).
  • Auto-derive package list from workspace metadata (acknowledged).
  • Pattern reuse across the ~10 other spacecat-* publishing repos belongs in a shared GitHub Action or aem-sites-architecture skill rather than in each repo.

Assessment

Ready to merge? No - two Critical fixes required before this PR functions as designed.

Reasoning: The npm trust github-actions typo (Critical 1) is a definite functional bug; the script fails to register any trust binding, and the merge would leave publishing broken. The branch-protection assertion (Critical 2) is a security model gap: the workflow comments describe a control that does not exist on the repo, leaving the trust binding unconstrained by branch or environment until either branch protection is configured AND/OR the binding includes an --environment claim. Once those two are addressed, this is a clear net-positive for credential security. The Important and Minor items are real but tractable as follow-ups or quick fixes alongside the Critical changes.

Next Steps

  1. Fix the npm trust github-actions -> npm trust github typo, run the script end-to-end, and capture output as evidence.
  2. Configure branch protection on main AND/OR add a GitHub Environment with main-only deployment policy + --environment to the script. Re-run the setup script with the environment binding before merge.
  3. Resolve the spacecat-shared-example inconsistency (mark private: true is simplest).
  4. Tighten the SR_NO_NPM_AUTH check to === 'true' across the 24 files.
  5. Capture pre-merge evidence that all 22 packages have OIDC trust bindings registered against the correct workflow path/environment.

Comment thread scripts/setup-npm-trusted-publishers.sh Outdated

for pkg in "${PACKAGES[@]}"; do
echo " Configuring: ${pkg}"
output=$(npm trust github-actions "${pkg}" \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Wrong npm trust subcommand. The script will fail at runtime for every package.

Verified against npm/cli source (lib/commands/trust/index.js):

static subcommands = {
  github: require('./github.js'),
  gitlab: require('./gitlab.js'),
  circleci: require('./circleci.js'),
  list: require('./list.js'),
  revoke: require('./revoke.js'),
}

There is no github-actions alias. Running this as-written produces Unknown subcommand: github-actions for all 22 packages, populates FAILED, and exits non-zero before configuring any trust binding.

If merged with the typo, the first release after merge will fail because no bindings exist and NPM_TOKEN is already gone from the workflow.

Fix: change to npm trust github "${pkg}" \ (also update the script header comments referencing the same subcommand). Run end-to-end against at least one package and capture the output as evidence before merging.

Comment thread .github/workflows/main.yaml Outdated
# to this file (adobe/spacecat-shared / .github/workflows/main.yaml). Renaming or moving
# this file breaks OIDC publishing — re-run scripts/setup-npm-trusted-publishers.sh after.
#
# Trust binding security relies on branch protection: do NOT weaken protection on `main`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Branch protection is asserted in this comment but does not exist on the repo.

Verified empirically:

$ gh api repos/adobe/spacecat-shared/branches/main/protection
{"message":"Branch not protected","status":"404"}
$ gh api repos/adobe/spacecat-shared/rules/branches/main
[]

No protection AND no rulesets on main. Combined with the structure of npm trust bindings (verified in npm/cli lib/commands/trust/github.js - bindings store { repository, workflow_ref.file } only, no branch or ref constraint), this means:

  • Anyone with write access can push to main or merge a PR that modifies main.yaml to bypass the if: github.ref == 'refs/heads/main' gate. OIDC tokens from any branch satisfy the trust claims.
  • The if: check is the only ref constraint, and a single-line PR can remove it.

Mitigations, in order of strength:

  1. (Strongest, code-fixable here) Bind a GitHub Environment (e.g. npm-publish) to the release job with deployment_branch_policy: main only, and add --environment npm-publish to the npm trust github calls in the setup script. The OIDC token then carries the environment claim and only main-policy environments can mint it.
  2. (Minimum, repo-admin) Configure branch protection or a ruleset on main (disallow direct push, require PR review, disallow force-push). Then the workflow's stated assumption actually holds.

This should land before any trust bindings are configured. Right now, even after the typo fix, the security model this PR describes is not the security model the repo enforces.

"changelogFile": "CHANGELOG.md",
}],
"@semantic-release/npm",
...(process.env.SR_NO_NPM_AUTH ? [] : ["@semantic-release/npm"]),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: example package is excluded from trust bindings but still loads @semantic-release/npm - latent break on next release.

Three signals disagree:

  • scripts/setup-npm-trusted-publishers.sh excludes @adobe/spacecat-shared-example from PACKAGES ("template package, not intended for automated publishing")
  • This file (packages/spacecat-shared-example/.releaserc.cjs:13) was updated with the SR_NO_NPM_AUTH shim, so it loads @semantic-release/npm when the env var is unset (i.e. in the release job)
  • packages/spacecat-shared-example/package.json is NOT marked "private": true and currently carries a real version

In the release job (where SR_NO_NPM_AUTH is NOT set), if conventional-commit history ever includes a feat: or fix: scoped to this package, semantic-release will attempt to publish via OIDC, find no trust binding on npmjs.com, and the entire release job will fail.

Fix (either is fine, private is more idiomatic):

  • Set "private": true in packages/spacecat-shared-example/package.json and let @semantic-release/npm skip it
  • Drop .releaserc.cjs from this directory entirely (semantic-release-monorepo skips packages without a config)

Comment thread .releaserc.cjs Outdated
"changelogFile": "CHANGELOG.md",
}],
"@semantic-release/npm",
...(process.env.SR_NO_NPM_AUTH ? [] : ["@semantic-release/npm"]),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: SR_NO_NPM_AUTH truthy check is a footgun across all 24 .releaserc.cjs files.

In JavaScript, the strings "false", "0", "no" are all truthy. If a future maintainer sets SR_NO_NPM_AUTH=false thinking it re-enables the npm plugin, npm publishing is still skipped. The workflow sets 'true' explicitly so current behavior is fine, but the contract is fragile.

Fix: change to process.env.SR_NO_NPM_AUTH === 'true' ? [] : ["@semantic-release/npm"] in all 24 files. One search-and-replace.

Even better: a small node test that loads any one .releaserc.cjs under three env values (undefined, 'true', 'false') and asserts the plugin presence/absence. Locks the contract and detects any future inversion.

env:
GITHUB_TOKEN: ${{ secrets.ADOBE_BOT_GITHUB_TOKEN }}
NPM_TOKEN: ${{ secrets.ADOBE_BOT_NPM_TOKEN }}
SR_NO_NPM_AUTH: 'true'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Dry-run never exercises the OIDC publish path - first main release is the first real test.

This dry-run step sets SR_NO_NPM_AUTH: 'true', which removes @semantic-release/npm entirely from the plugin chain in all 24 .releaserc.cjs files. On the merge commit to main, only the actual release job runs. So no PR build and no main build prior to the release step validates OIDC publish works end-to-end.

If trust bindings are misconfigured (wrong workflow filename, missing package, npm-side outage during setup), the failure mode is a broken release on main with revert as the only recovery.

Mitigation in this PR (cheapest first):

  • Capture the output of running the corrected setup-npm-trusted-publishers.sh against all 22 packages (and npm trust list for each) and attach to the PR or SITES-42702. Pre-merge evidence that bindings exist where they should.
  • (Better) A controlled canary publish on a release branch before the workflow change lands on main.

The PR description's deferred "publish --dry-run --provenance smoke-test on PRs" item would close this gap structurally; until then the pre-merge evidence is the minimum acceptable substitute.

alinarublea and others added 2 commits May 8, 2026 16:45
- Fix `npm trust github-actions` typo → `npm trust github` (Critical: script would fail for all 22 packages)
- Add GitHub Environment `npm-publish` to release job and `--environment` flag to trust script, replacing the non-existent branch protection as the security boundary
- Mark `spacecat-shared-example/package.json` as `private: true` to prevent latent publish failure if a release commit touches the template package
- Tighten `SR_NO_NPM_AUTH` check from truthy to `=== 'true'` across all 24 `.releaserc.cjs` files to avoid false-skip when set to `'false'` or `'0'`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bump root engines.npm to >=11.5.1 to match OIDC publishing requirement
  (workflow installs 11.13.0, setup script requires >=11.10.0)
- Add PACKAGES drift check to setup-npm-trusted-publishers.sh — diffs
  PACKAGES array against publishable workspace packages so a newly added
  package surfaces in preflight instead of failing silently on first release
- Sleep 2s between npm trust calls to avoid registry rate-limit throttling
  on bulk binding (per npm v11 bulk usage guidance)
- Print audit-trail summary (git SHA, npm user, timestamp, env) on success
  to give the trust-establishment ceremony a paste-into-ticket artifact
- Document SR_NO_NPM_AUTH removal trigger inline in the workflow so the
  shim does not silently outlive its purpose

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alinarublea alinarublea requested a review from solaris007 May 11, 2026 08:36
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Strengths

The round-1 fixes landed substantively with quality work:

  • npm trust github subcommand correctly applied (scripts/setup-npm-trusted-publishers.sh:147).
  • SR_NO_NPM_AUTH === 'true' strict equality across all 24 .releaserc.cjs files (verified consistent via grep across the tree).
  • packages/spacecat-shared-example/package.json correctly marked "private": true.
  • PACKAGES drift detection via node-based workspace scan in the setup script.
  • 2-second sleep between trust calls; per-package FAILED tracking with non-zero exit on any failure.
  • Audit trail block at end of setup script (timestamp, git SHA, npm user/version, registry, repo, workflow, env, package count) is the right shape for SITES-42702.
  • engines.npm >=11.5.1 <12.0.0 aligned with OIDC floor.
  • Workflow-level permissions: {} with per-job id-token: write scoping; persist-credentials: 'false' on both checkouts.
  • NPM_CONFIG_PROVENANCE: 'true' on release job adds sigstore attestations on every publish.
  • NPM_TOKEN fully removed from workflow env (verified across diff).
  • SR_NO_NPM_AUTH lifecycle is documented inline in the workflow (main.yaml:59-61).
  • Workflow rename warning comment is highly visible and points operators at the setup script.
  • Prior Important issues (truthy check footgun, example package latent break) and Minor issues (PACKAGES drift, engines.npm floor, audit trail, pacing) all addressed cleanly.
  • Pushed-back dry-run smoke test: counter holds, the setup-script audit trail is an acceptable substitute for a one-shot ceremony.

Issues

Critical (Must Fix)

  1. The npm-publish GitHub Environment does not exist server-side, and main has no branch protection. The trust-binding security model the PR describes is currently decorative, not enforced.

    Prior round's Critical 2 was marked addressed via GitHub Environment. The workflow correctly declares environment: npm-publish (main.yaml:69) and the setup script correctly passes --environment npm-publish to npm trust github. But the GitHub-side configuration was never done. Verified empirically:

    $ gh api repos/adobe/spacecat-shared/environments
    {"total_count":0,"environments":[]}
    $ gh api repos/adobe/spacecat-shared/branches/main/protection
    {"message":"Branch not protected","status":"404"}
    $ gh api repos/adobe/spacecat-shared/rulesets
    []
    

    When the release job first runs environment: npm-publish, GitHub auto-creates the environment with no deployment_branch_policy and no required reviewers. The OIDC token will include the environment: npm-publish claim regardless of which branch ran the job, because the claim is set by the workflow declaration, not validated by environment policy. The trust binding's --environment filter therefore enforces nothing beyond "the workflow declared this string."

    Attack chain: anyone with write access edits main.yaml to remove the if: github.ref == 'refs/heads/main' guard, pushes to a feature branch. The auto-created npm-publish environment has no deployment_branch_policy, so the job runs. OIDC token mints with the right repo, workflow_ref, and environment claims. npm trust binding accepts. Publish succeeds with provenance attestation over the malicious build. End-to-end compromise from a single repo-write event.

    Required order of operations before merge:

    • Create npm-publish environment with a deployment_branch_policy restricting to main, and configure at least one required reviewer.
    • Enable branch protection on main: require PR review, dismiss stale approvals on push, disallow direct push, require status checks (the test job).
    • Verify both via the setup-script preflight check below.
    • Re-run the setup script (npm trust is idempotent, so this is safe even if bindings were already registered).

Important (Should Fix)

  1. Setup script must preflight the GitHub Environment policy and branch protection (scripts/setup-npm-trusted-publishers.sh:36-72).

    Today's preflight verifies npm version, registry URL, whoami == adobe-bot, and the workflow file's presence on disk - all things that would make the trust binding wrong. It does not verify the things that make the security model correct: that npm-publish exists on the repo with a main-only branch policy, and that main is branch-protected. Without these, the script will happily register 22 bindings whose security boundary doesn't exist (Critical 1 above).

    Add at the end of the preflight block:

    ENV_JSON=$(gh api "repos/${REPO}/environments/${ENVIRONMENT}" 2>/dev/null) || {
      echo "ERROR: GitHub Environment '${ENVIRONMENT}' does not exist on ${REPO}."
      echo "       Create it with a main-only deployment_branch_policy before running this script."
      exit 1
    }
    gh api "repos/${REPO}/branches/main/protection" >/dev/null 2>&1 || {
      echo "ERROR: branch protection on 'main' is not configured."
      echo "       Trust binding has no meaningful guard without it."
      exit 1
    }
  2. PACKAGES drift check silently passes if node fails (scripts/setup-npm-trusted-publishers.sh:106-128).

    WORKSPACE_PACKAGES=$(node -e '...' 2>/dev/null) captures node's stdout but swallows stderr. Combined with set -uo pipefail (no -e) and path.join(process.cwd(), "packages") instead of process.env.REPO_ROOT, any failure inside node (wrong cwd, syntax error, malformed package.json) yields an empty WORKSPACE_PACKAGES, an empty DRIFT array, and a silent pass that proceeds straight to trust registration. The drift check becomes a no-op precisely when the workspace state is unreadable - exactly when its safety net is needed.

    Fix: pass REPO_ROOT explicitly, capture exit code, and fail closed on empty output:

    if ! WORKSPACE_PACKAGES=$(REPO_ROOT="${REPO_ROOT}" node -e '
        const fs = require("fs"); const path = require("path");
        const dir = path.join(process.env.REPO_ROOT, "packages");
        ...'); then
      echo "ERROR: drift check failed to enumerate workspace packages"; exit 1
    fi
    [ -z "$WORKSPACE_PACKAGES" ] && { echo "ERROR: workspace package list empty"; exit 1; }
  3. SR_NO_NPM_AUTH cleanup has no forcing function (main.yaml:59-62 + 24 .releaserc.cjs files).

    The "remove after 2 successful OIDC releases" instruction lives only in the workflow header comment and (implicitly) on the linked Jira ticket. Lifecycle markers in comments reliably outlive memory. After the trigger date the shim becomes dead supply-chain-adjacent code: 24 release configs coupled to an env var that should no longer exist, plus a one-line escape hatch that re-enables token-based publishing if ADOBE_BOT_NPM_TOKEN is ever re-added.

    Cheapest fix: drop a marker file (e.g. .npm-oidc-bootstrap-complete containing the date of the second successful release) and add a 5-line CI step that fails if the marker exists AND any .releaserc.cjs still contains SR_NO_NPM_AUTH. Alternative: file a follow-up issue with a concrete removal date assigned to a person, referenced from the workflow comment.

  4. Release job lacks timeout-minutes and concurrency group (main.yaml:64).

    With OIDC token exchange, sigstore signing, and npm publish now in the critical path, the number of upstream hang points grew but the GitHub Actions default per-job timeout (6 hours) is unchanged. A hung release ties up a runner and pushes triage well past the failure. Separately, if two PRs land on main faster than the first release completes, two release jobs run in parallel - two simultaneous OIDC mints, two sigstore submissions, and semantic-release's tag-then-publish flow assumes serialized execution on main.

    Fix:

    release:
      timeout-minutes: 15
      concurrency:
        group: npm-publish-main
        cancel-in-progress: false
  5. No CI lint enforces the SR_NO_NPM_AUTH guard across the 24 .releaserc.cjs files (workspace-wide).

    All 24 files are consistent today; consistency is preserved only by reviewer attention. A new package added by copy-pasting an older .releaserc.cjs that omits the guard, or a manual edit removing it from one file, would not be caught. While SR_NO_NPM_AUTH=true is set in CI the regression is loud (dry-run fails). After the shim is removed (per the stated lifecycle) a missing guard becomes a silent no-op - the regression vector flips. Ten-line CI step removes the entire class of single-file drift:

    missing=$(grep -L "SR_NO_NPM_AUTH === 'true'" packages/*/.releaserc.cjs .releaserc.cjs)
    if [ -n "$missing" ]; then echo "FAIL: SR_NO_NPM_AUTH guard missing in: $missing"; exit 1; fi

    This pairs with the SR_NO_NPM_AUTH tripwire from finding 4 - both guard the same future regression.

Minor (Nice to Have)

  1. Drift check is one-directional (scripts/setup-npm-trusted-publishers.sh:106-128). PACKAGES -> workspace coverage is verified (catches new package missing from PACKAGES) but not the inverse (a name in PACKAGES that no longer exists in the workspace - deleted, renamed, or flipped to private: true). Result over time: orphaned npm trust bindings on npmjs.com. Same node enumeration, no extra deps.

  2. Audit trail is stdout-only (scripts/setup-npm-trusted-publishers.sh:175-192). Operators are asked to paste the trail into SITES-42702, but if the script is piped or redirected the trail is gone. One-liner fix: also write to a sibling file (npm-trust-audit-${TIMESTAMP}.log).

  3. Sigstore is now a release prerequisite, no break-glass documented (main.yaml:76 NPM_CONFIG_PROVENANCE: 'true'). status.sigstore.dev has had multi-hour outages. Add an inline comment naming the override (NPM_CONFIG_PROVENANCE: 'false' temporarily, loses attestation for that release, restore after sigstore recovers), or link to a runbook.

  4. Rollback path lives in PR description, not in repo. Post-merge an on-call responding to a release failure needs both failure-mode triage (sigstore vs OIDC mint vs trust binding vs registry 5xx) and the revert + secret-retention steps. Move to a short RELEASE-RUNBOOK.md referenced from the workflow header.

  5. Workflow header warns about file rename; environment rename hazard is missing (main.yaml:3-9). Renaming the npm-publish GitHub Environment has the same silent-break failure mode: trust bindings include the environment claim, the new environment name will not be claimed by the OIDC token. Add one line to the existing comment.

  6. Stale "npm OIDC dry-run auth" comment on test job permissions (main.yaml:26). SR_NO_NPM_AUTH: 'true' removes @semantic-release/npm from the plugin chain in dry-run, so the justification no longer holds. id-token: write remains correct for ECR OIDC; trim the comment to avoid implying the test job participates in publish auth.

Recommendations

  • Shared .releaserc.base.cjs at workspace root: 24 .releaserc.cjs files now differ only by their containing directory. Every future change (shim removal, plugin reorder, semantic-release upgrade) requires touching 24 files in lockstep. A module.exports = require('../../.releaserc.base.cjs') per package collapses this to one place and makes the eventual SR_NO_NPM_AUTH removal a one-line edit.
  • Encapsulate the OIDC pattern before it propagates: ~10 sibling spacecat-* / mysticat-* / helix-* repos publish to npm and will need this migration. A composite action (adobe/spacecat-shared/.github/actions/npm-publish-oidc) or an aem-sites-architecture skill carries the workflow snippet, setup-script template, and prerequisites cleanly. Per workspace docs, the skill is the higher-leverage carrier for org-level standardization.
  • Periodic trust-binding audit cron: weekly job that runs npm trust list per package and diffs against expected { repo, workflow, environment }. Detects drift on the registry side, not just in the repo. Pairs with the existing PACKAGES check.
  • Failure alerting on the release job: today the only detection signal is "consumer service install fails downstream", which is slow. Minimum bar: if: failure() step that posts to a Slack channel or opens a GitHub issue. Pairs with the timeout/concurrency findings.
  • Add a --dry-run mode to the setup script: preflight + drift check + audit trail without npm trust calls. Useful for CI smoke-checking the script itself and for operators verifying local state before the real run.
  • Sigstore consumer verification policy: attestations are emitted but no Adobe-side consumer enforces them. Define what services importing @adobe/spacecat-shared-* should do if attestation is missing/invalid.

Out of scope, worth tracking

  • Deferred npm publish --dry-run --provenance smoke test on PRs (pushed-back from prior round; audit trail substitutes).
  • Action SHA pinning (actions/checkout@v6, actions/setup-node@v6, actions/cache@v5) - acknowledged in PR description.
  • on: [push] trigger narrowing to specific branches - acknowledged.
  • ADOBE_BOT_GITHUB_TOKEN PAT to GitHub App installation token - acknowledged.
  • ADOBE_BOT_NPM_TOKEN explicit rotation deadline: PR mentions "after 2 successful release cycles" without an owner or date target. Pin both in SITES-42702.
  • shellcheck/actionlint on the new shell + workflow artifacts.
  • Workflow rename CI enforcement: header-comment-only today.
  • CODEOWNERS on .github/workflows/main.yaml and scripts/setup-npm-trusted-publishers.sh.

Assessment

Ready to merge? No, with fixes.

Reasoning: One Critical finding blocks merge. The advertised security model relies on the npm-publish GitHub Environment having a main-only deployment branch policy AND main being branch-protected. Neither is configured today. The workflow + setup-script changes are correct as code, but they describe a security boundary that does not exist on the repo. Once the environment is created with the right policy, branch protection is enabled, and the setup script's preflight gates against both, this becomes a strong supply-chain improvement. The Important findings are all small, mechanical, and worth landing alongside the Critical fix to remove the remaining gaps between the asserted and enforceable security model.

Next Steps

  1. Create the npm-publish GitHub Environment with a main-only deployment branch policy and required reviewers.
  2. Enable branch protection on main (PR review, dismiss stale approvals, disallow direct push, require test status check).
  3. Add the environment + branch-protection preflight to the setup script.
  4. Tighten the drift check to capture node exit code and fail closed on empty output.
  5. Add timeout-minutes: 15 and a concurrency group to the release job.
  6. Add the SR_NO_NPM_AUTH lifecycle tripwire (marker file + CI step), or file a follow-up issue with a concrete date and owner.
  7. Add the CI lint for the 24-file SR_NO_NPM_AUTH guard.
  8. Re-run the setup script after items 1-3 land (npm trust is idempotent).
  9. Minor items are optional but cheap; bundle into the same fix commit if pursuing.

runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
environment: npm-publish
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: The npm-publish GitHub Environment does not exist server-side, and main has no branch protection. The trust-binding security model this PR describes is currently decorative, not enforced.

Verified empirically:

$ gh api repos/adobe/spacecat-shared/environments
{"total_count":0,"environments":[]}
$ gh api repos/adobe/spacecat-shared/branches/main/protection
{"message":"Branch not protected","status":"404"}

When this job first runs environment: npm-publish, GitHub auto-creates the environment with no deployment_branch_policy. The OIDC token will carry the environment: npm-publish claim regardless of branch (the claim is set by the workflow declaration, not validated by env policy). The trust binding --environment filter therefore enforces nothing beyond "the workflow declared this string." Combined with no branch protection on main, write access to the repo equals direct push to main equals OIDC publish capability.

Required order of operations before merge: create the npm-publish environment with a main-only deployment_branch_policy + required reviewer, enable branch protection on main, add the preflight check to the setup script, then re-run the setup script (npm trust is idempotent). See main review body for the full attack chain and required steps.

if [ ! -f "${REPO_ROOT}/.github/workflows/${WORKFLOW}" ]; then
echo "ERROR: workflow file not found: .github/workflows/${WORKFLOW}"
echo " Update the WORKFLOW constant in this script if main.yaml was renamed."
exit 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: missing GitHub Environment + branch-protection preflight.

Today's preflight verifies npm version, registry URL, whoami == adobe-bot, and the workflow file's presence on disk. It does not verify the things that make the security model correct: that npm-publish exists on the repo with a main-only branch policy, and that main is branch-protected. Without these, the script registers 22 bindings whose security boundary does not exist (Critical 1).

Add at the end of the preflight block:

ENV_JSON=$(gh api "repos/${REPO}/environments/${ENVIRONMENT}" 2>/dev/null) || {
  echo "ERROR: GitHub Environment '${ENVIRONMENT}' does not exist on ${REPO}."
  echo "       Create it with a main-only deployment_branch_policy before running this script."
  exit 1
}
gh api "repos/${REPO}/branches/main/protection" >/dev/null 2>&1 || {
  echo "ERROR: branch protection on 'main' is not configured."
  exit 1
}

Comment thread scripts/setup-npm-trusted-publishers.sh Outdated
WORKSPACE_PACKAGES=$(node -e '
const fs = require("fs");
const path = require("path");
const dir = path.join(process.cwd(), "packages");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: PACKAGES drift check silently passes if node fails.

WORKSPACE_PACKAGES=$(node -e '...' 2>/dev/null) captures stdout but swallows stderr. Combined with set -uo pipefail (no -e) and path.join(process.cwd(), "packages") instead of process.env.REPO_ROOT, any failure inside node (wrong cwd, syntax error, malformed package.json) yields an empty WORKSPACE_PACKAGES, an empty DRIFT, and a silent pass that proceeds straight to trust registration. The drift check becomes a no-op precisely when the workspace state is unreadable - exactly when the safety net is needed.

Fix: pass REPO_ROOT explicitly, capture exit code, and fail closed on empty output:

if ! WORKSPACE_PACKAGES=$(REPO_ROOT="${REPO_ROOT}" node -e '
    const dir = path.join(process.env.REPO_ROOT, "packages");
    ...'); then
  echo "ERROR: drift check failed to enumerate workspace packages"; exit 1
fi
[ -z "$WORKSPACE_PACKAGES" ] && { echo "ERROR: workspace package list empty"; exit 1; }

# SR_NO_NPM_AUTH skips @semantic-release/npm in the plugin chain (see all .releaserc.cjs).
# Remove this env var AND the matching guard from every .releaserc.cjs after 2 successful
# OIDC releases on main confirm publishing works without a token (tracked in SITES-42702).
SR_NO_NPM_AUTH: 'true'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: SR_NO_NPM_AUTH cleanup has no forcing function.

The "remove after 2 successful OIDC releases" instruction lives only in this comment and on SITES-42702. Lifecycle markers in comments reliably outlive memory. After the trigger date the shim becomes dead supply-chain-adjacent code: 24 release configs coupled to an env var that should no longer exist, plus a one-line escape hatch that re-enables token-based publishing if ADOBE_BOT_NPM_TOKEN is ever re-added.

Cheapest fix: drop a marker file (e.g. .npm-oidc-bootstrap-complete containing the date of the second successful OIDC release) and add a 5-line CI step that fails if the marker exists AND any .releaserc.cjs still contains SR_NO_NPM_AUTH. Alternative: file a follow-up issue with a concrete removal date assigned to a person, referenced from this comment.

# OIDC releases on main confirm publishing works without a token (tracked in SITES-42702).
SR_NO_NPM_AUTH: 'true'

release:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: release job lacks timeout-minutes and concurrency group.

With OIDC token exchange, sigstore signing, and npm publish now in the critical path, the number of upstream hang points grew but the GitHub Actions default per-job timeout (6 hours) is unchanged. A hung release ties up a runner and pushes triage well past the failure. Separately, if two PRs land on main faster than the first release completes, two release jobs run in parallel - two simultaneous OIDC mints, two sigstore submissions, and semantic-release's tag-then-publish flow assumes serialized execution on main.

Fix:

release:
  timeout-minutes: 15
  concurrency:
    group: npm-publish-main
    cancel-in-progress: false

alinarublea and others added 3 commits May 12, 2026 17:15
main.yaml:
- Add timeout-minutes: 15 + concurrency group (npm-publish-main, no cancel)
  to release job so semantic-release tag-then-publish cannot race on parallel
  merges and a hung release cannot tie up a runner.
- Add SR_NO_NPM_AUTH guard consistency check as a test-job step. Fails CI
  if any .releaserc.cjs is missing the === 'true' guard. Also includes a
  cleanup tripwire: if .npm-oidc-bootstrap-complete exists, fails CI until
  the shim is fully removed from all .releaserc.cjs and the workflow.
- Trim stale 'npm OIDC dry-run auth' justification from test job
  id-token: write — SR_NO_NPM_AUTH=true removes @semantic-release/npm
  from the dry-run plugin chain entirely.
- Add environment-rename hazard to workflow header (same silent-break mode
  as workflow-file rename).
- Document sigstore break-glass next to NPM_CONFIG_PROVENANCE.

setup-npm-trusted-publishers.sh:
- Add GitHub Environment + branch-protection preflight before any npm trust
  call. Verifies env exists, has main-only deployment_branch_policy, and
  main has branch protection. Without these the trust binding's
  '--environment' filter is decorative.
- Make drift check fail-closed: capture node exit code via if-let,
  pass REPO_ROOT explicitly (no cwd dependency), and abort on empty
  workspace enumeration. Previously a node failure silently passed.
- Add reverse drift detection — entries in PACKAGES no longer in the
  workspace surface as a warning so orphaned trust bindings on npmjs.com
  are visible (non-blocking; they don't break the current run).
- Write audit trail to a sibling log file (scripts/npm-trust-audit-
  TIMESTAMP.log) in addition to stdout, so a piped/redirected run still
  leaves an artifact on disk for SITES-42702.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main.yaml — Redesign SR_NO_NPM_AUTH guard step to address:
- Self-referential tripwire (HIGH): the previous marker-file branch grepped
  for 'SR_NO_NPM_AUTH' in main.yaml, which always matched the step's own
  bash, trapping cleanup PRs in unkillable red CI.
- Fail-open glob (HIGH): bash without nullglob passed literal
  packages/*/.releaserc.cjs to grep, which suppressed stderr and produced
  empty output — silently passing when files moved or renamed.
- Undocumented marker contract (MEDIUM): .npm-oidc-bootstrap-complete had
  no creation procedure documented anywhere.

New design auto-detects phase via the YAML mapping pattern of the env-var
declaration (regex anchored to indented YAML key, syntactically distinct
from the JS strict-equality '=== true' the step checks for in configs).
No marker file. Phase 1 (shim active) requires all configs to have the
strict guard. Phase 2 (shim removed) requires no config to reference
SR_NO_NPM_AUTH — partial-cleanup detection comes for free.

Adds shopt -s nullglob + non-empty files-array assert to close fail-open.

setup-npm-trusted-publishers.sh:
- Branch-protection preflight now PARSES the policy with gh api --jq and
  asserts: required_approving_review_count >= 1, enforce_admins.enabled,
  no force-push, no deletion, 'Test' in required_status_checks.contexts.
  Closes a bypass where a repo admin could weaken protection to the
  minimum config that still returns 200 while leaving main effectively
  unprotected.
- Audit trail no longer silently swallows write failures (set -uo pipefail
  without -e meant tee errors continued). Builds content in a variable,
  prints to stdout unconditionally, attempts file write as a non-blocking
  artifact, surfaces a WARNING if the file is empty/unwritten.
- Audit filename gets $$ (PID) suffix to avoid same-second collision.

Verified locally:
- bash -n / yaml.safe_load: syntax OK
- Phase-detection regex: matches only the env-var line (exactly 1 match in
  main.yaml at line 112); does not match the bash code in the same step
- Phase 1 on real repo: passes (24 configs have strict guard)
- Synthetic drift: correctly flags missing guard
- Phase 2 full cleanup (in-memory sim): correctly detected
- Phase 2 partial cleanup (in-memory sim): correctly flags leftover refs
- nullglob regression: confirmed OLD code fail-open, NEW code closed
- Branch-protection preflight against live repo state: 'OK'
- Branch-protection preflight against 4 synthetic weak inputs: each
  flagged correctly (zero approvals / admins exempt / force-push allowed
  / no Test in contexts)
- Audit trail success + failure paths

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alinarublea alinarublea requested a review from solaris007 May 12, 2026 15:40
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Round-3 changes resolve 11 of 12 round-2 findings cleanly and introduce some genuinely good design: the SR_NO_NPM_AUTH phase auto-detection (no marker file), the five-attribute branch protection jq check, and the decoupled audit-log write are all better than what they replace. Verified server-side: npm-publish environment exists with main-only deployment branch policy plus required-reviewer (spacecat-admins team); main branch protection enforces PR review, dismiss-stale, enforce-admins, no force-push, no deletion, and "Test" status check. The migration approach is sound and the security improvement is real.

Three Important items before merge: one is a small jq addition that closes a real attack chain in the protection-state check, one is moving the rollback procedures from PR description into the repo for 3am-page accessibility, and one is committing to extract this pattern before it propagates to the ~10 sibling repos that need the same migration.

Strengths

Prior round-2 findings now addressed:

  • Critical 1 (env + branch protection don't exist server-side): both are configured. Environment created 2026-05-12T14:20:43Z with main-only deployment_branch_policy; branch protection passes the script's own 5-attribute check.
  • Important 2 (script preflight for env + branch protection): added in commit 76842cb, hardened in ac8975f with detailed jq attribute checks and current-state-on-failure dump. The "what is actually configured" output on failure is the right operator affordance.
  • Important 3 (drift check silent fail): fail-closed via explicit REPO_ROOT, if ! capture of node exit code, and empty-output rejection.
  • Important 4 (SR_NO_NPM_AUTH cleanup forcing function): auto-detection from the workflow file itself. Cleaner than the marker file. Phase-1 regex ^[[:space:]]+SR_NO_NPM_AUTH:[[:space:]]+'true' and the JS guard pattern SR_NO_NPM_AUTH === 'true' are syntactically distinct, so the step cannot self-reference.
  • Important 5 (release job timeout + concurrency): timeout-minutes: 15 plus concurrency: { group: npm-publish-main, cancel-in-progress: false }. The "let in-flight finish" comment captures the rationale exactly - cancellation here would be the dangerous setting given semantic-release's tag-then-publish flow.
  • Important 6 (CI lint for SR_NO_NPM_AUTH guard across 24 files): the same tripwire step.
  • Minor 7 (drift one-directional): reverse drift detection added with explicit npm trust revoke remediation line in the warning.
  • Minor 8 (audit trail stdout-only): sibling log file with PID suffix for collision safety; decoupled file write so a failed write does not lose the audit content from stdout.
  • Minor 9 (sigstore break-glass): inline comment naming the NPM_CONFIG_PROVENANCE: 'false' override.
  • Minor 11 (env rename hazard): header comment now covers both file rename and environment rename.
  • Minor 12 (stale dry-run auth comment): trimmed.

Beyond addressed findings:

  • Workflow-level permissions: {} with per-job scoping is correct least-privilege. id-token: write only where needed.
  • persist-credentials: 'false' on both checkout steps prevents GITHUB_TOKEN persisting into git config.
  • NPM_CONFIG_PROVENANCE: 'true' ties sigstore attestations to every publish - real supply-chain integrity gain.
  • The setup script is well-engineered fail-closed code with idempotent re-run semantics.

Issues

Important (Should Fix)

1. Protection-state check misses dismiss_stale_reviews and bypass_pull_request_allowances (scripts/setup-npm-trusted-publishers.sh jq block - also posted inline).

Verified attack chain without dismiss_stale_reviews: true:

  1. Author opens PR with benign change, gets approval.
  2. Author force-pushes the PR branch (force-push to PR branches is allowed by default; only target-branch protection blocks it).
  3. PR retains the stale approval, satisfying the required_approving_review_count check.
  4. PR merges to main.
  5. Release job publishes attacker code via OIDC with a valid sigstore attestation.

dismiss_stale_reviews is currently true on main, but the script's preflight does not assert it - if a future admin disables it, the script still prints "preflight OK" and the trust boundary the binding depends on degrades silently.

bypass_pull_request_allowances.users currently lists adobe-bot (required for semantic-release commits - acknowledged as residual elsewhere). If anyone is added to that list later, the OIDC publish path becomes reachable from a non-PR push by the new actor. The script should at minimum detect unexpected additions.

Suggested additions to the jq chain, sibling to enforce_admins:

elif (.required_pull_request_reviews.dismiss_stale_reviews // false) != true then "weak: dismiss_stale_reviews disabled"
elif ((.required_pull_request_reviews.bypass_pull_request_allowances.users // []) +
      (.required_pull_request_reviews.bypass_pull_request_allowances.teams // []) +
      (.required_pull_request_reviews.bypass_pull_request_allowances.apps // []) | length) > 1 then "weak: unexpected bypass actors"

One-line concept, two-line fix. Worth landing in this PR.

2. Rollback procedures should live in the repo, not the PR description (new file: docs/RELEASE-RUNBOOK.md or RUNBOOK.md - also posted inline at the workflow header).

The PR description currently holds the only documented procedures for:

  • First-release OIDC failure -> revert PR, NPM_TOKEN resumes
  • 2-release dual-write window before token rotation
  • Trust-binding break-glass (rename of main.yaml or npm-publish environment)
  • Sigstore break-glass (flip NPM_CONFIG_PROVENANCE, with the realistic timing: open PR -> get review -> Test passes -> merge -> re-trigger release, which is roughly 30-60 minutes minimum given enforce_admins.enabled = true)

PR descriptions are hidden, edited, lost in archive search, and require GitHub auth to retrieve. For a 3am page, an in-repo runbook is greppable from the checkout the on-call already has open. The work is small (lift from the PR description, add the sigstore-timing reality, add a partial-publish recovery note for the case where the 15-minute timeout kills a release mid-way through 22 packages and leaves a git tag for an incomplete release). This was Minor in round 2 and remained unresolved; calling it out again as Important from the operational lens.

3. Extract before propagating to the sibling repos (PR description mentions this pattern will need to land in ~10 other repos).

The current code shape is monolithic per-repo: a ~320-line setup script with hardcoded REPO/WORKFLOW/PACKAGES/protection-schema plus a ~50-line workflow guard step with the same regex contract. Drift across 10 copies is supply-chain risk, not test-suite tedium - any of those copies regressing on the protection-state check or the phase-detection regex is a silent security weakening, not a build failure.

Either (a) extract the setup script into a parameterised tool (@adobe/oidc-trusted-publishers-setup or a workspace skill) and the workflow guard into a composite action (adobe-actions/verify-sr-no-npm-auth-guard@v1) before this lands in the second repo, or (b) commit to extraction explicitly with a tracking issue and a hard cap (e.g. "extract by repo 3"). Either is fine; the current shape with no plan to extract is what to avoid. Option (b) is acceptable for closing this finding - file the tracking issue and link it from the PR.

Minor (Nice to Have)

Phase-1 grep masks "file not found" exit code (.github/workflows/main.yaml SR_NO_NPM_AUTH check). missing=$(grep -L "..." "${files[@]}" || true) swallows exit code 2 (grep file-not-found) the same way it swallows exit 1 (all matched). If root .releaserc.cjs is deleted, grep writes No such file or directory to stderr but the tripwire reports OK. The tripwire's stated job is consistency assertion; a missing input file should fail loud. Either guard with [ -f "$f" ] per path, or capture grep's actual exit code separately.

Phase-detection regex brittle to YAML quoting variants (same step). SR_NO_NPM_AUTH: "true" (double quotes) and SR_NO_NPM_AUTH: true (unquoted YAML boolean) both produce the same env value at runtime but neither matches ^[[:space:]]+SR_NO_NPM_AUTH:[[:space:]]+'true'. Result: a Prettier/yamllint reformat that flips quote style trips the workflow into phase 2 and fails on "leftover references". Fails closed (CI fails), so not a security issue, but a confusing maintenance signal. Broaden the regex to accept any of 'true'|"true"|true, or pin quote style in the inline comment as documented contract.

JS guard pattern brittle to JS quote-style (same step). The phase-1 search for SR_NO_NPM_AUTH === 'true' would miss === "true" if a Prettier config flips JS string quotes. Tighten to === ['"]true['"], or pin quote style in lint config.

Audit log file write [ -s ] check passes on truncated writes (scripts/setup-npm-trusted-publishers.sh audit block). Disk-full mid-write leaves a partial file; [ -s ] returns true; the "Audit log also written to" message lies. Atomic mktemp + mv is safer:

TMPFILE=$(mktemp -t npm-trust-audit.XXXXXX) && \
  printf '%s\n' "${AUDIT_CONTENT}" > "$TMPFILE" && \
  mv "$TMPFILE" "${AUDIT_FILE}"

Protection-state check reports only the first weakness. If main is missing approval requirement AND admin exemption AND Test status check, the operator fixes one, re-runs, sees the next, etc. UX cost only, no security change. The aggregator pattern would report all weaknesses at once.

npm version preflight fails open on non-numeric prefix (scripts/setup-npm-trusted-publishers.sh early preflight). CURRENT_NPM=v11.13.0 (some npm --version variants emit the v prefix) -> cut -d. -f1 yields v11 -> [ "$CURRENT_MAJOR" -lt 11 ] errors with "integer expression expected" -> conditional reads as false -> preflight passes. Add a regex sanity check on CURRENT_NPM before the integer comparison.

gh auth status not preflight-checked. The script verifies command -v gh (installed) but not auth state. A user with gh installed but not logged in hits cryptic gh api errors during environment + protection checks. One-line addition.

Audit log files in scripts/ are not gitignored. The audit log writes to ${SCRIPT_DIR}/npm-trust-audit-${TIMESTAMP}-$$.log, inside the repo. An operator who runs the script then does git add scripts/ could accidentally commit one. Add scripts/npm-trust-audit-*.log to .gitignore, or write to ${TMPDIR:-/tmp}/ instead.

Recommendations

  • GH App token migration before propagation: the adobe-bot bypass on main branch protection is a residual hole in the post-OIDC threat model. Pre-OIDC: compromise NPM_TOKEN -> publish malicious code directly. Post-OIDC: compromise adobe-bot's GitHub token -> push to main bypassing PR review -> next release publishes via the trust binding with a valid sigstore attestation that says "this was built from main" (true but compromised). Not introduced by this PR; the PR materially reduces overall risk by removing the NPM_TOKEN path. Track as a separate hardening (options: PR-based version bumps with auto-merge, split adobe-bot into separate npm and GitHub identities, push-time signatures). Sequence ahead of the 10 sibling repos so they inherit the cleaner trust shape.
  • Periodic protection-drift detection cron: weekly job that runs the same 5-attribute jq check (now expanded per Important 1) against main's branch protection. Catches an admin tweaking protection rules and forgetting to revert - a real failure mode for long-lived security baselines.
  • --preflight-only flag on the setup script: the preflight portion is already complete enough to stand alone. A dry-run flag enables periodic cron re-verification and local iteration when modifying the script. Pairs with the cron above.
  • Extract the CI tripwire into a fixture-tested script (scripts/verify-sr-no-npm-auth-guard.sh with tests/fixtures/ covering phase-1-clean, phase-1-missing-guard, phase-2-clean, phase-2-leftover-guard). The tripwire is now the permanent regression guard for the shim cleanup; making it self-tested future-proofs it against the regex brittleness flagged in the Minor section.
  • Lift the protection-schema contract into the extracted tool, not into each repo's copy. When the org's branch-protection model evolves (CODEOWNERS, signed commits, ruleset migration), one update propagates instead of N.

Pushed-back from prior rounds (resolved)

  • Dry-run never exercises OIDC publish path (round-1 Important 3): pushback held in round 2, audit-trail substitutes; not re-raising.

Assessment

Ready to merge? No, with fixes.

Reasoning: This is a clear net-positive supply-chain change. It eliminates the long-lived NPM_TOKEN compromise vector, narrows the publish path to a server-enforced main-only environment guarded by branch protection, adds sigstore provenance to every release, and makes the publish path observable. Round 3 cleanly addresses 11 of 12 round-2 findings, with the new auto-detection tripwire being a real improvement over the marker-file approach it replaced.

The three Important items above (protection-check additions, in-repo runbook, extraction commitment) are all small, mechanical, and worth landing in this PR or in fast-follows so the as-built configuration and the as-tested configuration stay coupled and so the migration to 10 sibling repos starts from a clean baseline.

Next Steps

  1. Add dismiss_stale_reviews and bypass_pull_request_allowances checks to the protection-state jq chain.
  2. Lift the rollback procedures into docs/RELEASE-RUNBOOK.md (include sigstore-timing reality and partial-publish recovery).
  3. File a tracking issue for setup-script + workflow-guard extraction, link from this PR.
  4. Minor items are optional; the Phase-1 grep silent-pass and audit-log truncated-write cases are the two most worth picking up if you are already touching the file.

Comment thread scripts/setup-npm-trusted-publishers.sh Outdated
else "OK" end
' 2>/dev/null || echo "missing")

if [ "$PROTECTION_STATE" != "OK" ]; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Protection-state check misses dismiss_stale_reviews and bypass_pull_request_allowances.

Verified attack chain without dismiss_stale_reviews: true:

  1. Author opens PR, gets approval.
  2. Author force-pushes the PR branch (allowed by default for PR branches).
  3. Stale approval is retained, satisfying required_approving_review_count.
  4. PR merges. Release job publishes attacker code via OIDC with valid sigstore attestation.

bypass_pull_request_allowances is not checked at all. adobe-bot is currently the only entry (required for semantic-release); silent additions to this list would degrade the OIDC trust boundary.

Suggested additions to the jq chain:

elif (.required_pull_request_reviews.dismiss_stale_reviews // false) != true then "weak: dismiss_stale_reviews disabled"
elif ((.required_pull_request_reviews.bypass_pull_request_allowances.users // []) +
      (.required_pull_request_reviews.bypass_pull_request_allowances.teams // []) +
      (.required_pull_request_reviews.bypass_pull_request_allowances.apps // []) | length) > 1 then "weak: unexpected bypass actors"

# Renaming the 'npm-publish' GitHub Environment referenced below has the same silent-break
# failure mode — re-run the setup script after.
#
# Trust binding security relies on the 'npm-publish' GitHub Environment, which must be
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Rollback procedures should live in the repo, not the PR description.

The PR description currently holds the only documented procedures for first-release OIDC failure, dual-write rotation window, trust-binding break-glass (file/env rename), and sigstore break-glass (flip NPM_CONFIG_PROVENANCE - realistic timing is 30-60 minutes given enforce_admins.enabled = true).

For a 3am page, an in-repo docs/RELEASE-RUNBOOK.md is greppable from the checkout the on-call already has open. PR descriptions get lost in archive search and require GitHub auth.

Work is small (lift from the PR description, add the sigstore-timing reality, add a partial-publish recovery note for the case where the 15-minute timeout kills a release mid-way through 22 packages).

alinarublea and others added 2 commits May 13, 2026 10:34
Setup script (scripts/setup-npm-trusted-publishers.sh):
- npm version preflight now strips leading 'v' and rejects non-semver
  input rather than falling through silently (Minor 9: 'v11.13.0' would
  yield 'v11' from cut, integer comparison errors, conditional reads as
  false, preflight bypass).
- Added 'gh auth status' preflight (Minor 10): catches gh-installed-but-
  not-authenticated case before it surfaces as cryptic api errors.
- Branch-protection preflight expanded to aggregator pattern (Important 1):
  reports every weakness in one pass instead of one-at-a-time; adds
  dismiss_stale_reviews check (closes the stale-approval + force-push-to-
  PR-branch attack chain the round-3 reviewer documented); adds
  bypass_pull_request_allowances check (catches unexpected actors added
  to the bypass list beyond adobe-bot).
- Audit-trail write now atomic via mktemp + mv (Minor 8): a partial write
  on disk-full no longer leaves a half-finished file that '[ -s ]' would
  accept as success.

Workflow (.github/workflows/main.yaml):
- Per-file existence check before grep in the SR_NO_NPM_AUTH guard step
  (Minor 5): catches grep exit code 2 (file not found) that '|| true'
  previously swallowed. Stale glob expansion or symlink-to-nowhere now
  fails the tripwire loudly.
- Phase-detection YAML regex now accepts 'true' | "true" | true (Minor 6):
  a Prettier/yamllint reformat flipping quote style no longer trips
  phase-2 cleanup detection.
- Phase-1 JS guard regex now accepts both ' and " (Minor 6): same
  robustness across JS quote-style reformats.

Documentation (docs/RELEASE-RUNBOOK.md, new):
- Lifts rollback procedures out of the PR description for 3am-page
  accessibility (Important 2). Covers the 5 failure modes: first-OIDC-
  fails revert path, sigstore-down break-glass (with realistic 30-60min
  timing per branch protection), partial-publish recovery, workflow/env
  rename hazard, setup-script preflight refusal. Plus the full
  ADOBE_BOT_NPM_TOKEN rotation sequence with which steps the CI tripwire
  enforces atomicity over.

Verified locally:
- bash -n / yaml parse — OK
- npm version parser against 6 inputs (incl. 'v'-prefix, garbage) — all
  classify correctly
- Aggregator jq against 5 synthetic protection states (live OK, single
  weakness, multiple weaknesses) — every weakness reported
- Phase-detection regex against 6 YAML variants (3 quote styles + 3
  malformed) — only matches valid forms
- JS guard regex against 4 variants — accepts both quote styles
- New step body against real repo — PHASE 1 PASS, 24 configs
- Atomic audit write success + failure paths

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alinarublea alinarublea requested a review from solaris007 May 13, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants