Skip to content

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802

Open
nollymar wants to merge 4 commits into
mainfrom
worktree-qa-stuck-prs-automation
Open

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802
nollymar wants to merge 4 commits into
mainfrom
worktree-qa-stuck-prs-automation

Conversation

@nollymar
Copy link
Copy Markdown
Member

Summary

  • New scheduled workflow cicd_scheduled_qa-stuck-check.yml runs every 3 days at 13:00 UTC and pings each team's Slack channel about issues stuck in QA.
  • An issue is considered stuck when, on the dotCMS - Product Planning project (testing assign #7), it has Status QA or Done, lacks both QA : Passed and QA : Not Needed labels, and the project item has not changed in 3+ days.
  • Team is resolved from the issue's Team : <name> label only. Team→channel mapping was added to .claude/triage-config.json.

Behavior

  • No issues stuck → the notify job is skipped entirely; no Slack messages are sent.
  • Manual dispatch → defaults to dry_run=true; rendered messages are printed to the job summary instead of posting. Override stuck_days to test other thresholds.
  • Scheduled run → posts to each team channel via chat.postMessage using SLACK_BOT_TOKEN. Per-team posts run in a matrix so one Slack failure doesn't block the others.

Auth note

The workflow queries an org-owned ProjectV2 board, which the default GITHUB_TOKEN cannot read. It uses secrets.CI_MACHINE_TOKEN (already used by cicd_scheduled_notify-seated-prs.yml and the slack-channel-resolver). If that token lacks the read:project scope, the first dry-run will surface a 401 and we'll need to grant it.

Test plan

  • node --check on the script
  • actionlint on the workflow
  • Local smoke test (node .github/scripts/qa-stuck-check/test-find-stuck-issues.js) with a fake project payload — verifies Status / label / staleness / team-resolution filters and the per-team grouping
  • Trigger workflow_dispatch with dry_run=true on this branch and review the job summary against the actual project state
  • Confirm Status field options on project testing assign #7 are exactly QA and Done (case-insensitive match in the script, but spelling must agree)
  • Flip dry_run=false once for a live end-to-end Slack test against the team channels before letting the schedule run unattended

🤖 Generated with Claude Code

Adds a workflow that runs every 3 days at 13:00 UTC, queries the
dotCMS - Product Planning project (#7) for issues in QA/Done status
that lack a QA : Passed / QA : Not Needed label and have not changed
in 3+ days, and pings the responsible team's Slack channel.

Team is resolved from the issue's "Team : <name>" label; team→channel
mapping lives in .claude/triage-config.json. Manual dispatch supports
dry-run (default) so the rendered messages can be reviewed before any
real Slack post. If no issues qualify, the notify job is skipped and
no message is sent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Issue Linking Required

This PR could not be linked to an issue. All PRs must be linked to an issue for tracking purposes.

How to fix this:

Option 1: Add keyword to PR body (Recommended - auto-removes this comment)
Edit this PR description and add one of these lines:

  • This PR fixes #123 or Fixes: #123

  • This PR closes #123 or Closes: #123

  • This PR resolves #123 or Resolves: #123

  • Other supported keywords: fix, fixed, close, closed, resolve, resolved
    Option 2: Link via GitHub UI (Note: won't clear the failed check)

  1. Go to the PR → Development section (right sidebar)

  2. Click "Link issue" and select an existing issue

  3. Push a new commit or re-run the workflow to clear the failed check
    Option 3: Use branch naming
    Create a new branch with one of these patterns:

  • 123-feature-description (number at start)

  • issue-123-feature-description (issue-number at start)

  • feature-issue-123 (issue-number anywhere)

Why is this required?

Issue linking ensures proper tracking, documentation, and helps maintain project history. It connects your code changes to the problem they solve.---

This comment was automatically generated by the issue linking workflow

@github-actions github-actions Bot added the Area : CI/CD PR changes GitHub Actions/workflows label May 21, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @nollymar's task in 2m 29s —— View job


PR Review

  • Read the changed files
  • Analyze the find-stuck-issues.js script for bugs / edge cases
  • Analyze the workflows
  • Check the test file coverage
  • Check the triage-config.json change

Overall the script is well-defended (loud failures for misconfigured Status field / null project / unparseable STUCK_DAYS, labels-pagination skip, multi-team grouping, CLOSED+QA drop). Tests cover the main risk surfaces. Below are the things worth a second look.


🟠 assertStatusFieldVisible ratio is computed over all project items, not just Issues

find-stuck-issues.js:173-190 checks withStatus / items.length, where items is the full ProjectV2 page including PullRequest and DraftIssue content. The filter to __typename === 'Issue' happens later, in the per-item loop.

If project #7 contains a meaningful number of PRs or draft items that don't have a Status field value set, the denominator gets diluted and the run can fail with a misleading "field falls outside the GraphQL fieldValues page" error even though every actual Issue has Status.

Two options:

  1. Filter to item.content?.__typename === 'Issue' before computing the ratio, or
  2. Loosen the failure mode (warn instead of fail) and rely on the per-item silent skip.

I'd lean toward (1) — keeps the loud-failure behavior but scopes it to the population the script actually cares about. Fix this →


🟡 workflow_dispatch input stuck_days: 0 short-circuits validation

cicd_scheduled_qa-stuck-check.yml:37:

STUCK_DAYS: ${{ inputs.stuck_days || '3' }}

In GitHub Actions expression syntax, 0 || '3''3'. So a user dispatching with stuck_days=0 silently gets the default 3 instead of the loud STUCK_DAYS must be >= 1 failure that validateConfig would produce. The test testInvalidStuckDaysFailsLoudly covers '0' because it sets the env var directly — it doesn't exercise the YAML coalesce path.

Minor (since validating 0 mostly catches typos), but the test gives a false sense of coverage for that specific code path. Either drop the || '3' (rely on default: 3) or use ${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}.


🟡 Slack esc only handles the three required chars; titles with * / _ / backticks render with formatting

cicd_scheduled_qa-stuck-check.yml:102-105 escapes &, <, >. Slack docs do say those are the only mandatory escapes, but a real-world issue title like Fix *thing* or add __init__.py will get rendered as bold/italic in the message. Cosmetic, not a bug — flagging because the PR description called out the escaping work explicitly. If you want belt-and-braces, the usual trick is to wrap titles in a <{url}|{text}> link (already done — link text is parsed as mrkdwn but with fewer interpretations) or replace `*_~`` with their entity/zero-width equivalents.


🟡 Job-summary markdown isn't escaped at all

find-stuck-issues.js:209-211 builds:

- [#${e.number}](${e.url}) — ${e.title} · status: ${e.status} ...

A title containing ] followed by ( (rare but possible: Tracking issue [follow-up] (see #X)), or unbalanced backticks, will mangle the GH job summary rendering. Lower stakes than Slack since this is internal-only — worth knowing but probably fine to leave.


🟢 Post to Slack step lacks set -euo pipefail

cicd_scheduled_qa-stuck-check.yml:147-159 — already acknowledged in PR description point (6). The script will still surface failures via the ok != true check, but a curl-level failure (DNS, TLS, transient 5xx) produces an empty $response, which jq parses fine and the message becomes Slack post failed: (empty). Adding set -euo pipefail + checking curl's exit would give a clearer signal. Cheap fix.


🟢 No Slack retry on 429

A chat.postMessage 429 with Retry-After fails the matrix job (fail-fast: false, so other teams still post). The scheduled cadence is Mon/Thu so it's unlikely to hit rate limits across ~4 teams, but if Slack returns 429 the only recovery is manual re-dispatch. Acceptable for this volume; flagging only because the workflow has no observability on it beyond the failed job.


🟢 Minor observations (no action needed)

  • Number.isFinite(STUCK_DAYS) && STUCK_DAYS >= 1 accepts 3.7parseInt floors it to 3. Already acknowledged.
  • mkItem test helper defaults state based on status, which is convenient but means the existing tests don't exercise OPEN+Done (a legitimate-but-unusual state — open issue parked in Done). Script reports it; test doesn't assert.
  • assertStatusFieldVisible short-circuits when items.length === 0, so the null-project / empty-board case correctly skips the ratio check.
  • parseInt(PROJECT_NUMBER || '7', 10) is fine since the value is hardcoded in the workflow; acknowledged.

Verdict

Tightly written. The one item I'd actually fix before merging is (1)assertStatusFieldVisible over the unfiltered population is the most likely cause of a future "scheduled run fails for no apparent reason" page. The rest are nice-to-haves.
· Branch

- Cron now '0 13 * * 1,4' (Mon/Thu) instead of '*/3' day-of-month,
  which produced uneven cadence across month boundaries.
- Bumped GraphQL caps: fieldValues 20->50, labels 30->50, assignees
  10->20, so the Status / Team labels can't silently fall off the page.
- Added a Status-field presence assertion: fails the run if fewer than
  half of items expose a readable Status (surfaces schema mismatches in
  dry-run instead of returning empty results).
- Slack message wording now says "no project activity for Nd" / "last
  project update Nd ago" rather than "stuck for Nd", reflecting that
  ProjectV2Item.updatedAt is a proxy and any project-field edit resets
  the clock. Same caveat in the job summary.
- Dropped "(cc @user)" framing: Slack rendered it as plain text, not
  an actual mention. Assignees are still listed for context.
- Multiple "Team : X" labels on one issue now report the issue under
  each matched team (previously only the first label-order match got
  notified).
- Added a guard for projectV2 == null (renamed project / token without
  read:project scope) so dry-runs fail with a clear error rather than
  a "Cannot read properties of null" stack.
- stuck_days is now embedded on each group; matrix builder no longer
  needs to re-resolve the env var.
- Test suite expanded to cover: empty project, multi-team labels,
  missing Status field (fail loudly), partial-missing Status (warn,
  continue), null projectV2, and a team configured without a slack
  channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Addressed in 93afc3f:

  • 🔴 Cron0 13 */3 * *0 13 * * 1,4 (Mon/Thu 13:00 UTC). Even cadence, no month-boundary glitches.
  • 🟠 fieldValues(first: 20) — bumped to 50; labels 30→50; assignees 10→20. Added a presence assertion that fails the run when fewer than 50% of items expose a readable Status (surfaces "wrong field name" or "still capped" in dry-run rather than silently returning empty).
  • 🟠 updatedAt semantics — Slack header now reads "N QA/Done issue(s) with no project activity for Xd+ days" and each line "last project update Nd ago"; trailing paragraph spells out the caveat: "Metric: time since the project item last changed (any field). May undercount if other fields were edited recently." Same wording in the job-summary.
  • 🟡 Slack @-cc — dropped, since @name was plain text not a mention. Assignees still listed (· assignees: foo, bar).
  • 🟡 Multi Team label — issue now reports under every matched team, not just first-in-label-order. Test case added.
  • 🟡 STUCK_DAYS propagation — embedded on each group (stuck_days: N); matrix job reads group.stuck_days instead of re-resolving via workflow env.
  • 🟢 Null-project guard — explicit setFailed with a "token may lack read:project scope" hint instead of Cannot read properties of null. Test case added.
  • 🟢 Fragile channel deref — channel is now stored on the team accumulator at insert time, removing the teamChannels[team] re-lookup in grouping.
  • 🟢 Tests — expanded from 1 scenario to 7: main, empty project, multi-team labels, missing-Status loud failure, partial-Status warn-and-continue, null project, team-without-slack-channel ignored. node test-find-stuck-issues.js runs them all; actionlint clean.

Second pass on review feedback.

- Validate STUCK_DAYS up front: must be a finite integer >= 1 or the
  run fails. Without this, parseInt('abc') -> NaN -> cutoff invalid ->
  every QA/Done item passes the staleness gate (false positive
  storm). Same risk for 0 / -1, which slid cutoff to now/future.
- workflow_dispatch.inputs.stuck_days is now type: number so GH
  rejects non-numeric input at dispatch time.
- Labels page-size: bumped to 100 and the query now returns pageInfo.
  Issues whose labels overflow page 1 are skipped with a warning
  rather than evaluated against an incomplete label set (the
  QA : Passed gate could otherwise produce false positives).
- Concurrency group "qa-stuck-check" with cancel-in-progress: false
  so a manual dry-run can't overlap a scheduled live post.
- Escape Slack mrkdwn-sensitive chars (&, <, >) in team / status /
  title / assignee strings so a title like "Fix <X> & <Y>" doesn't
  break the <url|text> link syntax. jq --arg still handles JSON
  escaping; this is the orthogonal mrkdwn layer.
- Tests: added invalid-STUCK_DAYS path and labels-overflow path.
  Suite is now 9 cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Second-review pass landed in 18197c7:

Bugs / risks

  1. parseInt(STUCK_DAYS) unguarded — added validateConfig: finite integer ≥ 1 or the run fails up front. workflow_dispatch.inputs.stuck_days switched from string to number so GH rejects bad input at dispatch time. Test asserts abc / 0 / -1 all throw (empty stays valid since it's how scheduled triggers signal "use default").
  2. Labels pagination — bumped labels(first: 50)labels(first: 100), query now returns pageInfo, and issues with hasNextPage: true are skipped with a warning rather than evaluated against an incomplete label set. Test asserts the skip + warning.
  3. Concurrency — added concurrency: { group: qa-stuck-check, cancel-in-progress: false }. Manual dry-run can't race the scheduled run.
  4. updatedAt proxy false-negative shape (Priority/Iteration bumps reset the clock) — acknowledged limitation; no signal change short of a different data source. Open to revisiting cadence (weekly × 7-day vs Mon/Thu × 3-day) but leaving as-is for now since the wording already calls out "any project-field edit resets the clock".
  5. Slack mrkdwn escaping&, <, > now escaped in team / status / title / assignee text via an esc helper. jq --arg continues to handle JSON-level escaping; this fixes the orthogonal mrkdwn layer (e.g. titles like Fix <X> & <Y>).

Smaller stuff
6. ✅ inputs.stuck_days type → number.
7. ⏸ STATUS_PRESENCE_MIN_RATIO left hardcoded — one knob, internal heuristic, not worth widening the env surface.
8. ⏸ DST drift in daysStuck — acknowledged, no action.
9-10. ⏸ No action (existing convention).
11. ✅ Tests for NaN/invalid stuck_days and >100 labels both added. Suite is now 9 cases, all green.

actionlint and node --check clean; full test suite passes.

Addresses the two items the second review flagged as fix-before-merge.

- find-stuck-issues now drops items where state == 'CLOSED' and the
  project Status is anything other than Done. A closed issue parked
  in Status=QA is board-cleanup (closed-as-completed without moving
  the card, or closed-as-not-planned without a QA : Not Needed label)
  — pinging the team about it is noise. Added a test case asserting
  CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.
- New workflow cicd_pr_qa-stuck-check-validate.yml runs
  `node .github/scripts/qa-stuck-check/test-find-stuck-issues.js`
  on PRs that touch the script, scheduled workflow, this validator
  workflow, or the triage config. Sub-second job (zero npm deps).
  Without this, regressions would only surface on the next Mon/Thu
  13:00 UTC scheduled run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Addressed the two items called out as "actually fix before merging" in b0f674cb07:

2. ✅ Issue.state drift — find-stuck-issues now drops items where state == 'CLOSED' and project Status is anything other than Done. Rationale: a closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without QA : Not Needed), not pending QA. Test case asserts CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.

3. ✅ Tests not in CI — added cicd_pr_qa-stuck-check-validate.yml. Path-filtered pull_request trigger on the script/workflow/config; checks out the relevant files, sets up Node from .nvmrc, runs the smoke test. Zero npm deps, sub-second job. Concurrency group cancels in-progress on synchronize.

The 🟡/🟢 items I'm leaving as judgment calls (happy to address any if you want):

  • (1) YAML comment clarifying schedule-doesn't-dry-run — fair, can add.
  • (4) assignees first: 20 no-overflow guard — display-only impact.
  • (5) parseInt(PROJECT_NUMBER) validation — env is hard-coded in the YAML, lower likelihood of a typo than user input.
  • (6) set -euo pipefail in the Slack post step — defensive, harmless.
  • (8) Number.isInteger(STUCK_DAYS) to reject decimals — minor.
  • (9) Project-not-found error fires after retries: 3 — known limitation of the github-script wrapper; acceptable since the result is "fail with clear message after a few seconds."

Test suite is now 10 cases, all green; actionlint clean on both workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : CI/CD PR changes GitHub Actions/workflows

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant