Skip to content

DEVOP-560: add org-wide daily Shai-Hulud IOC sweep workflow#8

Open
srt0422 wants to merge 10 commits into
mainfrom
scott/devop-560-shai-hulud-sweep
Open

DEVOP-560: add org-wide daily Shai-Hulud IOC sweep workflow#8
srt0422 wants to merge 10 commits into
mainfrom
scott/devop-560-shai-hulud-sweep

Conversation

@srt0422
Copy link
Copy Markdown

@srt0422 srt0422 commented May 25, 2026

Summary

Adds the first workflow under .github/workflows/ in this repo — a scheduled
daily sweep for Shai-Hulud indicators of compromise across every repo in the
allora-network org, plus a rolling GitHub Issue and Slack page on findings.

Closes DEVOP-560.

What ships

  • .github/workflows/shai-hulud-sweep.ymlschedule: '7 4 * * *'
    (04:07 UTC daily, off-peak + off-minute) plus workflow_dispatch for manual
    runs. Permissions limited to contents: read + issues: write. Pinned SHAs
    for actions/checkout@v4.2.2 and actions/upload-artifact@v4.4.3, matching
    the convention in allora-network/ci-workflows-private.
  • scripts/shai-hulud-ioc-sweep.sh — canonical detection logic, vendored
    verbatim from allora-network/skills@71aeefb (skills/shai-hulud-defense/scripts/).
    See file header for refresh procedure / pinned commit.
  • docs/plans/2026-05-25-devop-560-shai-hulud-sweep.md — design notes for
    why we vendor the script, how the rolling issue is maintained, when Slack
    fires, and the GH_ORG_READ_TOKEN follow-up.

Detection coverage (per the vendored script)

  • Lockfile entries (npm/pip/Go) matching .github/security/ioc-packages.txt.
  • Any .js/.cjs/.mjs file ≤ 2 MB whose SHA-256 matches
    .github/security/ioc-hashes.txt (filename-agnostic — bundle.js rename
    doesn't bypass).
  • Persistence: */.github/workflows/shai-hulud*.{yml,yaml} at repo root.
  • npm install/postinstall/preinstall lifecycle scripts matching
    node …bundle.js, curl|sh, wget|sh, base64 -d|--decode|-D,
    eval $(…), or npx … bundle.
  • Go replace directives: untrusted-host RHS, absolute-path RHS,
    top-level-path mismatch (Scenario C in-org redirect), and local replacements
    (./ / ../) flagged for human review.
  • Go workflow env settings (GOSUMDB=off, GONOSUMCHECK, GOINSECURE,
    GOFLAGS=*-insecure) — direct and indirect (vars/secrets/env/inputs).
  • Public exfil repos matching ^[Ss]hai-[Hh]ulud under org:allora-network
    AND under each org member (rate-limited).

Outputs

Sweep result Script exit Rolling issue Slack
Clean (no findings) 0 no-op no-op
Operational (clone_failed / check_skipped / go_local_replace) 2 comment appended (or issue opened with label shai-hulud-sweep) no-op
IOC-grade 1 comment appended (or issue opened) paged via ${{ secrets.SLACK_SECURITY_WEBHOOK }}

Forensic evidence (clones of repos that produced IOC findings) is uploaded as
a workflow artifact for 30 days so humans can inspect the matched file without
re-cloning point-in-time evidence.

The workflow never auto-closes the rolling issue; humans drive close/reopen so
triage state survives across daily runs.

Secrets used

  • SLACK_SECURITY_WEBHOOK — org secret; payload only delivered on IOC-grade
    findings. No-ops gracefully if unset (warning, not failure).
  • GH_ORG_READ_TOKENoptional org secret. When present, preferred over
    the default GITHUB_TOKEN for org-wide enumeration so private repos and the
    org-members exfil search are covered. When absent, member enumeration emits
    check_skipped operational findings — visible partial coverage, never a
    silent false-clean.

Verification

  • actionlint .github/workflows/shai-hulud-sweep.yml — clean (no findings).
  • python3 -c 'yaml.safe_load(...)' — parses.
  • Manual workflow_dispatch recommended after merge to verify
    gh issue and Slack paths end-to-end against the live org.

Followups (intentionally out-of-scope for this PR)

  • Provision GH_ORG_READ_TOKEN (fine-grained PAT or GitHub App token with
    read:org + repo:read) once org-admin signs off — the workflow already
    prefers it when present.
  • Quarterly review of the trusted Go module-path allowlist (GO_TRUSTED_HOSTS_RE
    in the script) to keep go_suspicious_replace false-positive rate low.

Made with Cursor


Summary by cubic

Adds a daily org‑wide Shai‑Hulud IOC sweep that scans all allora-network repos, updates a rolling issue, and pages Slack on incident‑grade findings with alert‑dedup and a weekly re‑page. Meets DEVOP-560 requirements: scheduled workflow at .github/workflows/shai-hulud-sweep.yml, org repo iteration, IOC list checks, member exfil search, rolling issue updates, and Slack notifications with minimal perms.

  • New Features

    • Added .github/workflows/shai-hulud-sweep.yml (cron 7 4 * * * + workflow_dispatch; minimal perms; serialized concurrency). Pins actions/checkout@v4.2.2 and actions/upload-artifact@v4.4.3.
    • Vendored scripts/shai-hulud-ioc-sweep.sh with SHA‑256 sidecar verification; reads .github/security/ioc-packages.txt and .github/security/ioc-hashes.txt (# schema:v1). Detection covers lockfiles (incl. structured package-lock.json), sub‑2MB JS SHA‑256 hashes, exact Shai‑Hulud persistence workflow filenames, suspicious npm lifecycle scripts, Go replace/path‑mismatch/unsafe‑env, and org/member public exfil search.
    • Outputs: maintains a rolling issue labeled shai-hulud-sweep; Slack via SLACK_SECURITY_WEBHOOK on IOC with dedup gating; prefers GH_ORG_READ_TOKEN, else falls back to GITHUB_TOKEN and emits check_skipped; uploads only findings.json, summary.md, repos.txt for 30 days.
    • Added .github/CODEOWNERS to require @allora-network/security (and @allora-network/devops for the workflow). Added plan doc with action SHA‑pin rotation and follow‑ups.
  • Bug Fixes

    • Slack: IOC alert dedup by IOC hash‑stamp (first‑seen/changed/≥7‑day re‑page); filter dedup markers to github-actions[bot]; write the paged-at marker only after a successful Slack send; fail‑open if the stamp can’t be computed; 3‑attempt retry with backoff and Retry‑After honoring; fixed HTTP code capture.
    • Rolling issue: new shared “Find rolling issue” step (oldest‑open via --search sort:created-asc) reused by dedup, updates, and paged‑marker steps; IOC comments include visible page‑decision plus hidden stamp markers; Slack still fires on IOC even if the issue update fails.
    • Safety: sanitize untrusted strings in Slack and issue bodies; verify the vendored script via a locked‑path SHA‑256 sidecar before execution.
    • Artifacts/robustness: restrict uploads to structured outputs; suffix artifact name with ${{ github.run_attempt }}; placeholder summary on pre‑aggregation failure; final run summary surfaces the Slack‑dedup tri‑state explicitly.
    • Detection tuning: narrow persistence detection to exact IOC filenames (avoids self‑alerts); add # schema:v1 headers and assertions; extend GO_TRUSTED_HOSTS_RE to include cometbft; refresh the .sha256 sidecar.

Written for commit c10d0dc. Summary will update on new commits. Review in cubic

Adds the first workflow under .github/workflows/ for the allora-network/.github
repo: a scheduled daily sweep (04:07 UTC) plus workflow_dispatch that scans
every repo in the org for Shai-Hulud indicators of compromise, maintains a
rolling GitHub issue labelled `shai-hulud-sweep`, and pages Slack via the
SLACK_SECURITY_WEBHOOK secret on incident-grade findings.

Detection logic lives in scripts/shai-hulud-ioc-sweep.sh, vendored verbatim
from allora-network/skills@71aeefb (skills/shai-hulud-defense). The script is
vendored rather than cloned at workflow time because that repo is private and
the workflow's default GITHUB_TOKEN cannot read it; vendoring also keeps the
daily sweep working through upstream rename/outage. See the script header for
the refresh procedure.

Key design choices documented in docs/plans/2026-05-25-devop-560-shai-hulud-sweep.md:
- IOC inputs read from .github/security/ioc-packages.txt + ioc-hashes.txt
  (DEVOP-561, merged in PR #2). Script validates the `# schema:v1` header
  before running so a silent seed-list format change fails closed.
- Rolling issue: workflow finds an existing open issue with the label and
  appends a comment, else opens a new one. Humans drive close/reopen so
  triage state survives across daily runs.
- Slack page only fires on IOC-grade findings (script exit 1). Operational
  findings (exit 2 — clone_failed / check_skipped / go_local_replace) update
  the issue but do not page after-hours.
- Permissions: `contents: read` + `issues: write` only.
- Pinned SHAs for actions/checkout (v4.2.2) and actions/upload-artifact
  (v4.4.3) match the convention in allora-network/ci-workflows-private.
- Prefers a `GH_ORG_READ_TOKEN` secret if present (private-repo + member
  enumeration); falls back to GITHUB_TOKEN. In the fallback path, member
  enumeration emits `check_skipped` operational findings so the partial
  coverage is visible in the rolling issue.

Linear: https://linear.app/alloralabs/issue/DEVOP-560
Co-authored-by: Cursor <cursoragent@cursor.com>
@srt0422 srt0422 added shai-hulud Shai-Hulud supply-chain defense work needs-human-review labels May 25, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

1 issue found

Linked issue analysis

Linked issue: DEVOP-560: Create org-wide daily IOC sweep workflow in .github repo

Status Acceptance criteria Notes
.github/workflows/shai-hulud-sweep.yml exists in questa repo PR adds the workflow file at .github/workflows/shai-hulud-sweep.yml.
Cron: daily at an off-peak local time, off-minute (e.g. '7 4 * * *') Workflow schedule is set to '7 4 * * *' and includes workflow_dispatch for manual runs.
Iterate all org repos via gh api orgs/allora-network/repos --paginate The vendored script enumerates repos using gh api --paginate /orgs/$ORG/repos and the workflow runs that script.
For each repo: shallow clone, run sweep checks, report findings Script uses git clone --depth 1 to shallow-clone each repo, runs the collection/checks, and aggregates findings; workflow uploads artifacts and updates the rolling issue.
Compare against IOC lists at .github/security/ioc-packages.txt and .github/security/ioc-hashes.txt Workflow passes those two paths to the script; the script validates the schema header and uses both lists in its matching logic.
⚠️ Search the GitHub API for public repos under org members named ^[Ss]hai-[Hh]ulud Script implements org-scoped and per-member searches and records findings, but per-member enumeration can be limited without a provisioned GH_ORG_READ_TOKEN and will emit check_skipped; member-side search is rate-limited with sleeps. Implementation exists but effective coverage depends on token provisioning and rate limits.
Maintain a single rolling GitHub Issue in .github repo; append new findings Workflow finds an open issue with label shai-hulud-sweep (oldest first), appends a comment if present, or creates a labeled rolling issue otherwise.
⚠️ If any new finding fires, post to a Slack incoming webhook (SLACK_SECURITY_WEBHOOK org secret) Workflow posts to the Slack webhook, but only for IOC-grade runs (script exit code 1). Operational findings (exit 2) update the rolling issue but do not page Slack. The step also no-ops if SLACK_SECURITY_WEBHOOK is unset.
Sweep checks — Lockfile entries matching ioc-packages.txt (name@version) Script builds per-ecosystem needle lists and performs structured and substring matching against npm/pip/Go lockfiles.
Sweep checks — bundle.js files anywhere with SHA-256 matching ioc-hashes.txt Script collects .js/.cjs/.mjs files ≤ 2 MB and compares SHA-256 against the provided hashes list, emitting ioc_bundle_hash findings.
Sweep checks — .github/workflows/shai-hulud*.yaml / shai-hulud-workflow.yml persistence detection Script scopes workflow-file scanning to the repo-root .github/workflows directory and flags shai-hulud*.{yml,yaml} as persistence_workflow findings.
Sweep checks — Postinstall patterns: node bundle.js, curl|sh, wget|sh, base64-decode chains in package.json scripts Script inspects package.json scripts for install/postinstall/preinstall and matches a broad regex covering node …bundle.js, curl|sh, wget|sh, base64 -d/--decode/-D, eval $(…), npx … bundle patterns.
Sweep checks — The webhook.site exfil URL substring The acceptance criteria list webhook.site substring matching, but I cannot find a specific check for the webhook.site URL substring in the vendored script or workflow; other exfil detection (public-exfil repo name, suspicious curl targets) exists, but no explicit webhook.site pattern match is present.
Workflow uses minimal permissions: contents: read, issues: write Workflow permissions block sets contents: read and issues: write and no broader permissions are granted.
PR merged (closure of DEVOP-560) Acceptance requires merging the PR; the PR is open (this is the review), so the 'merged' criterion is not yet satisfied by the current state.
Architecture diagram
sequenceDiagram
    participant Sched as Cron Schedule (04:07 UTC)
    participant Action as GitHub Actions Workflow
    participant Script as shai-hulud-ioc-sweep.sh
    participant GHAPI as GitHub API (REST/GraphQL)
    participant Repos as Org Repos (clone targets)
    participant Issue as Rolling Issue (.github repo)
    participant Slack as Slack Security Webhook
    participant Artifact as Uploaded Artifact

    Note over Sched,Artifact: NEW: Daily org-wide IOC sweep

    alt Scheduled trigger (cron '7 4 * * *')
        Sched->>Action: Trigger workflow
    else Manual trigger (workflow_dispatch)
        Action->>Action: Manual run started
    end

    Action->>Action: concurrency serialization (no cancel-in-progress)
    Action->>Action: Set GH_TOKEN (GH_ORG_READ_TOKEN || GITHUB_TOKEN)
    Action->>Action: Checkout .github repo (IOC lists + script)
    Action->>Script: Run with ORG, IOC package/hash files

    Note over Script: Detection logic (vendored from alla-network/skills)

    Script->>GHAPI: List all repos in org (public + private if token allows)
    GHAPI-->>Script: Repo list

    Script->>GHAPI: List org members (for exfil repo search)
    alt GH_TOKEN has read:org
        GHAPI-->>Script: Member list
    else GH_TOKEN lacks read:org
        Script->>Script: Emit check_skipped operational finding
    end

    loop Per repo in org
        Script->>Repos: git clone (via gh auth credential helper)
        alt Clone succeeds
            Script->>Script: Scan for IOC patterns
            alt IOC package match found
                Script->>Script: finding() - ioc_package_match
            end
            alt IOC hash match found (.js/.cjs/.mjs <= 2MB)
                Script->>Script: finding() - ioc_bundle_hash
            end
            alt Suspicious workflow files found
                Script->>Script: finding() - persistence_workflow
            end
            alt Suspicious npm lifecycle scripts found
                Script->>Script: finding() - suspicious_lifecycle_script
            end
            alt Go replace directives anomalies found
                Script->>Script: finding() - go_suspicious_replace
            end
            alt Go unsafe CI env vars found
                Script->>Script: finding() - go_unsafe_env
            end
            opt Finding detected (non-operational)
                Script->>Script: Preserve clone as forensic evidence
                Script->>Script: Append to dirty-repos list
            end
        else Clone fails
            Script->>Script: finding() - clone_failed (operational)
        end
    end

    loop Check public exfil repos (org scope)
        Script->>GHAPI: Search repos matching ^[Ss]hai-[Hh]ulud under org
        GHAPI-->>Script: Exfil repo list
        alt Exfil repos found
            Script->>Script: finding() - public_exfil_repo
        end
    end

    loop Check public exfil repos (member scope)
        alt GH_TOKEN supports member search
            Script->>GHAPI: Search repos per member
            GHAPI-->>Script: Exfil repo list per member
            alt Exfil repos found
                Script->>Script: finding() - public_exfil_repo_member
            end
        else Token lacks permission
            Script->>Script: finding() - check_skipped (operational)
        end
    end

    Script->>Script: Aggregate findings to findings.ndjson + summary.md
    alt Exit code 0 (clean)
        Script-->>Action: rc=0
    else Exit code 1 (IOC findings)
        Script-->>Action: rc=1
    else Exit code 2 (operational only)
        Script-->>Action: rc=2
    end

    Action->>Artifact: Upload sweep output + forensic evidence clones (30-day retention)

    alt rc == 1 or rc == 2 (findings exist)
        Action->>Issue: Find open issue with label "shai-hulud-sweep"
        alt Existing issue found
            Action->>Issue: Append comment with run summary
        else No existing issue
            Action->>Issue: Create new issue with label + summary
        end
    end

    alt rc == 1 (IOC-grade findings only)
        alt SLACK_SECURITY_WEBHOOK is set
            Action->>Slack: POST run summary (capped at ~2.8 KB)
            Slack-->>Action: 200 OK
        else Webhook not set
            Action->>Action: Log warning, skip (no failure)
        end
    end
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread .github/workflows/shai-hulud-sweep.yml Outdated
srt0422 and others added 4 commits May 25, 2026 00:50
Four mechanical fixes flagged as safe_auto by the ce-code-review headless
pass on PR #8:

- Add `--connect-timeout 5 --max-time 15` to the Slack webhook curl so a
  hung incoming-webhook endpoint cannot stall the workflow up to the
  60-minute job timeout.
- Gate the Page-Slack step with `always() && rc == '1'` so an IOC-grade
  finding still pages Slack when the preceding Update-rolling-issue step
  failed (the rolling issue is a redundant channel — Slack is the primary
  page).
- Suffix the upload-artifact name with `${{ github.run_attempt }}` so a
  re-run does not 409 on actions/upload-artifact@v4's unique-name rule.
- Hoist the `output_dir` GITHUB_OUTPUT write to immediately after
  `mkdir -p "$OUTPUT_DIR"` so the upload-artifact step's `always() &&
  output_dir != ''` gate survives a mid-sweep crash (script abort would
  otherwise leave output_dir unset and skip evidence upload).

Re-validated with `actionlint`. No behavior change beyond the four
hardening fixes; the rc-based control flow and rolling-issue / Slack
semantics are unchanged.

Linear: https://linear.app/alloralabs/issue/DEVOP-560
Co-authored-by: Cursor <cursoragent@cursor.com>
… and silent failures (DEVOP-560)

Apply six ce-code-review findings to the daily Shai-Hulud IOC sweep:

- Finding B (P1): rolling-issue lookup now uses --search with
  sort:created-asc, so a long-running incident issue stays canonical
  even when a duplicate is filed later (`gh issue list` defaults to
  newest-first and exposes no --sort flag).
- Finding C (P1): wrap the Slack webhook POST in a 3-attempt retry
  loop with 5s/15s/45s backoff. Retries on 408/429/5xx + curl-level
  failures; honors Retry-After on 429; bails on terminal 4xx.
- Finding E (P1): strip backtick/<>|*_ from the attacker-controllable
  IOC `detail` field before wrapping it in a Slack code fence or
  inlining it in the GitHub issue body. The uploaded findings.json
  remains the canonical un-sanitized source for forensic review.
- Finding D (P1): restrict the uploaded artifact path to the
  structured outputs (findings.json, summary.md, repos.txt) instead
  of the entire output_dir. Anyone with actions:read on this repo
  can download artifacts, and the previous wildcard included raw
  clones / preserved evidence trees of private org repos.
- Finding H (P2): when the script exits rc != 0 without producing
  summary.md (precondition failure / pre-aggregation crash), emit a
  minimal placeholder so the rc != 0 -> rolling-issue contract holds
  and the failure surfaces in triage instead of being silently
  dropped.
- Finding G (P2): commit scripts/shai-hulud-ioc-sweep.sh.sha256 and
  verify it as the first action of the sweep step. The vendored
  script's canonical source lives upstream in allora-network/skills;
  the sidecar is the in-repo integrity gate. A PR that modifies the
  script body without refreshing the sidecar fails this step loudly
  instead of executing a tampered detector.

Workflow validated with `actionlint` and `python3 -c "import yaml;
yaml.safe_load(...)"` after each edit. 346 lines (under the 350 cap).

Linear: https://linear.app/alloralabs/issue/DEVOP-560
Co-authored-by: Cursor <cursoragent@cursor.com>
…EVOP-560)

A single PR that modifies the daily Shai-Hulud sweep workflow, the
vendored detector script, the SHA-256 integrity sidecar, or the IOC seed
lists can silently disable detection if no human review is enforced.
This adds an in-repo CODEOWNERS rule requiring `@allora-network/security`
approval on those paths (with `@allora-network/devops` co-owning the
workflow file for routine operational tweaks). CODEOWNERS itself is
self-owned so a single PR cannot rewrite the rules + disable detection
in lockstep.

Team slugs were verified via `gh api orgs/allora-network/teams/security`
and `.../devops` on 2026-05-25.

The complementary "Require review from Code Owners" branch-protection
rule is an org-admin task and is documented as a follow-up in the plan
doc; this commit only handles the in-repo half.

Linear: https://linear.app/alloralabs/issue/DEVOP-560
Co-authored-by: Cursor <cursoragent@cursor.com>
Two ce-code-review findings against the plan doc:

- Finding I (P2): document the rotation procedure for the third-party
  action SHA pins (actions/checkout, actions/upload-artifact). Names
  owner (@allora-network/devops, with security co-review via the
  workflow's CODEOWNERS rule), cadence (quarterly + on CVE), canonical
  source for the latest release SHA per action, and a 4-step rotation
  procedure. Notes `.github/dependabot.yml` as the automation follow-up.
- Finding F (P1, deferred): document missed-daily-run / cron-disabled
  observability as out-of-scope for this PR. The fix is materially
  additive (separate watchdog workflow or external healthcheck) and
  doesn't belong inline with the initial sweep ship.

Also adds the branch-protection "Require review from Code Owners"
follow-up surfaced by Finding A — the in-repo CODEOWNERS rule was
landed in the prior commit but the actual blocking gate is org-admin
territory.

Linear: https://linear.app/alloralabs/issue/DEVOP-560
Co-authored-by: Cursor <cursoragent@cursor.com>
srt0422 and others added 5 commits May 26, 2026 10:07
Co-authored-by: Cursor <cursoragent@cursor.com>
- P0 (script): Narrow persistence_workflow glob to exact known IOC
  filenames (shai-hulud.yml / shai-hulud.yaml / shai-hulud-workflow.yml /
  shai-hulud-workflow.yaml) so the legitimate defense workflow
  .github/workflows/shai-hulud-sweep.yml no longer self-detects as
  an IOC on every daily sweep — guaranteed false page → alert fatigue.

- P1 (seed files): Add '# schema:v1' header to ioc-packages.txt and
  ioc-hashes.txt. Without the packages header the new schema-version
  assertion in the detector exits 2 at startup every run, leaving the
  sweep structurally inert.

- P2 (script): Add parallel '# schema:v1' assertion against
  HASHES_FILE — mirrors the packages-file gate so a future reformat of
  the hashes seed list fails loud instead of silently zero-matching.

- P2 (script): Add cometbft to default GO_TRUSTED_HOSTS_RE so Cosmos/
  CometBFT same-path version pins (replace github.com/cometbft/cometbft
  => github.com/cometbft/cometbft <version>) no longer trip
  go_suspicious_replace once the sweep is unblocked from the schema:v1
  gate above.

- Regenerate scripts/shai-hulud-ioc-sweep.sh.sha256 in lockstep with
  the detector edits so the workflow's integrity-gate passes.

Co-authored-by: Cursor <cursoragent@cursor.com>
…-560)

Without this gate the bare `if: steps.sweep.outputs.rc == '1'` Slack step
pages on every IOC-grade run, so a standing unresolved IOC pages the
channel daily and conditions responders to mute it — classic alert
fatigue. Raised by cubic at PRRT_kwDOLZ5Xss6Ee5gN and independently by
four ce-code-review reviewers (P1, anchor 100).

Implementation:

- New `ioc-dedup` step (rc=1 only) computes a stable IOC stamp as the
  sha256 of the sorted `{repo,rule,path,detail}` TSV of IOC-grade rows
  in findings.json (`ts` deliberately excluded so an identical IOC set
  produces an identical stamp across daily runs).
- Looks up the rolling issue's full comment history (cross-page sort)
  for the most recent `<!-- shai-hulud-ioc-stamp: ... -->` and
  `<!-- shai-hulud-paged-at: ... -->` markers.
- Decides `should_page`:
    * first IOC-grade run after clean (no prior stamp)        → page
    * IOC set differs from previous stamp                     → page
    * same IOC set but >= 7d since last Slack page            → page
    * otherwise                                               → skip
  Fail-open when findings.json is missing/empty on rc=1: page so an
  unknown-state run surfaces visibly rather than dedup-silencing.
- Rolling-issue update step now embeds the stamp marker on every rc=1
  comment and the paged-at marker only when Slack actually fires, so a
  deduped comment carries forward the older real paged-at timestamp and
  the weekly re-page window stays honest.
- Slack step gated on `should_page == 'true'`. A new `Slack page
  suppressed by IOC dedup` step emits a workflow notice for visibility,
  and the final-run-summary step surfaces the dedup decision too.
- Visible `- **Slack page:** yes|suppressed (reason: ...)` footer in the
  rolling-issue comment body makes the decision obvious to humans
  scanning the issue, alongside the hidden HTML markers used by the
  next run's dedup lookup.

Plan doc: the Slack-alert-path decision now spells out the dedup +
weekly-repage policy and warns explicitly against regressing to a bare
`rc == '1'` gate, so the next reviewer doesn't reintroduce the
alert-fatigue regression. IOC_RULES_RE drift between workflow and
script is called out as a coupling that must stay in sync.

Refs: DEVOP-560, PRRT_kwDOLZ5Xss6Ee5gN (cubic), ce-code-review anchor 100
Co-authored-by: Cursor <cursoragent@cursor.com>
- (P2) Append `|| true` to the ioc-dedup current_stamp `jq | sha256sum
  | awk` pipeline so a malformed findings.json (or mid-run mutation)
  routes through the documented fail-open guard at lines 230-244 instead
  of aborting the step under `set -euo pipefail` and silently
  fail-CLOSING the Slack page. Mirrors the `|| true` already present
  on the four sibling pipelines in the same step.

- (P2) Fix Slack curl http_code capture: replace
  `... || echo 000)` with `... || true)` followed by
  `: "${http_code:=000}"`. The prior form appended an extra '000'
  to curl's own '%{http_code}' output, producing the literal '000000'
  which fell through the `000|408|429|5*` transient-classification
  case to terminal=0 and disabled the curl-level retry path the loop
  exists for.

- (P3) Replace the two-branch `if [ "${SHOULD_PAGE:-true}" = "true" ]`
  in the Final run summary with an explicit three-way `case`
  (true / false / *) so the unknown-state branch emits an
  `::error::` rather than defaulting to a false "Slack paged" claim
  when the ioc-dedup step crashed before writing $GITHUB_OUTPUT.
  Resolves the three-way contradiction between the Slack gate
  (strict ==true), the suppression-notice gate (!=true), and this
  summary.

Co-authored-by: Cursor <cursoragent@cursor.com>
- (P2 #5) Extract a new `Find rolling issue` step (gated on
  `rc=='1' || rc=='2'`) that resolves the rolling-issue number ONCE
  per run via the canonical `gh issue list ... sort:created-asc`
  query and exposes it as `steps.find-rolling-issue.outputs.issue_num`.
  Replace the duplicated inline `gh issue list` calls in the ioc-dedup
  and rolling-issue-update steps with the shared output. Removes the
  drift-hazard `# same query as the update step below — keep in sync`
  coupling and closes the TOCTOU window where a human could close the
  rolling issue between the two independent lookups.

- (P1 #1) Filter the ioc-dedup comment scan to `github-actions[bot]`
  authorship. Previously the `gh api ... --jq '.[] | {body, created_at}'`
  projection accepted markers from ANY commenter, so anyone with
  `issues: write` (or anyone able to social-engineer a maintainer into
  pasting attacker-supplied marker text) could forge
  `<!-- shai-hulud-ioc-stamp: <sha256> -->` or
  `<!-- shai-hulud-paged-at: <iso8601> -->` into the rolling issue and
  silently suppress real Slack pages by poisoning the dedup chain.
  Only this workflow (running as GITHUB_TOKEN) emits canonical markers,
  and its comments are attributed to `github-actions[bot]` — restrict
  the source set accordingly. Defense-in-depth follow-up (binding
  markers to the emitting run_id and verifying via gh api) deferred.

- (P1 #2) Move paged-at marker emission to a dedicated post-Slack step
  (`Persist Slack-paged marker`) gated on
  `success() && rc=='1' && should_page=='true'` so a failed Slack
  delivery never writes a paged-at timestamp. The rolling-issue update
  step keeps writing the IOC stamp marker (which represents the dedup
  decision input, NOT the Slack-delivery outcome — that's correct
  gating). The dedup reader already scans the most-recent paged-at
  marker across ALL bot-authored comments, so splitting the markers
  across two comments composes correctly with no parser change.
  Previously the paged-at marker was committed BEFORE the Slack page
  ran, so a failed Slack send would still record a paged-at timestamp
  and silently corrupt the dedup chain for up to 7 days (next
  IOC-grade run would believe Slack had paged, suppress its own page,
  and the standing IOC would stop alerting until the weekly re-page
  window expired).

  The new step has a `gh issue list` fallback for the rare case where
  the update step created a fresh rolling issue this run (so
  find-rolling-issue's output was empty); fail-OPEN warning if no
  issue is resolvable at all so a missing paged-at marker just forces
  the next run to page conservatively.

Verification: actionlint clean; YAML parses (11 steps in canonical
order: checkout → verify-tools → sweep → upload → find-rolling-issue
→ ioc-dedup → update-rolling-issue → slack-page → persist-paged-at →
slack-suppressed-notice → final-summary).

Refs: DEVOP-560, ce-code-review run 20260526-101810-4793bf13
findings #1 (anchor 100, security+adversarial), #2 (anchor 100,
correctness+adversarial+reliability), #5 (anchor 75, maintainability).
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review shai-hulud Shai-Hulud supply-chain defense work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant