feat(ci): scheduled QA-stuck issue check with Slack notifications#35802
feat(ci): scheduled QA-stuck issue check with Slack notifications#35802nollymar wants to merge 4 commits into
Conversation
Adds a workflow that runs every 3 days at 13:00 UTC, queries the dotCMS - Product Planning project (#7) for issues in QA/Done status that lack a QA : Passed / QA : Not Needed label and have not changed in 3+ days, and pings the responsible team's Slack channel. Team is resolved from the issue's "Team : <name>" label; team→channel mapping lives in .claude/triage-config.json. Manual dispatch supports dry-run (default) so the rendered messages can be reviewed before any real Slack post. If no issues qualify, the notify job is skipped and no message is sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ Issue Linking RequiredThis PR could not be linked to an issue. All PRs must be linked to an issue for tracking purposes. How to fix this:Option 1: Add keyword to PR body (Recommended - auto-removes this comment)
Why is this required?Issue linking ensures proper tracking, documentation, and helps maintain project history. It connects your code changes to the problem they solve.--- This comment was automatically generated by the issue linking workflow |
|
Claude finished @nollymar's task in 2m 29s —— View job PR Review
Overall the script is well-defended (loud failures for misconfigured Status field / null project / unparseable 🟠
|
- Cron now '0 13 * * 1,4' (Mon/Thu) instead of '*/3' day-of-month, which produced uneven cadence across month boundaries. - Bumped GraphQL caps: fieldValues 20->50, labels 30->50, assignees 10->20, so the Status / Team labels can't silently fall off the page. - Added a Status-field presence assertion: fails the run if fewer than half of items expose a readable Status (surfaces schema mismatches in dry-run instead of returning empty results). - Slack message wording now says "no project activity for Nd" / "last project update Nd ago" rather than "stuck for Nd", reflecting that ProjectV2Item.updatedAt is a proxy and any project-field edit resets the clock. Same caveat in the job summary. - Dropped "(cc @user)" framing: Slack rendered it as plain text, not an actual mention. Assignees are still listed for context. - Multiple "Team : X" labels on one issue now report the issue under each matched team (previously only the first label-order match got notified). - Added a guard for projectV2 == null (renamed project / token without read:project scope) so dry-runs fail with a clear error rather than a "Cannot read properties of null" stack. - stuck_days is now embedded on each group; matrix builder no longer needs to re-resolve the env var. - Test suite expanded to cover: empty project, multi-team labels, missing Status field (fail loudly), partial-missing Status (warn, continue), null projectV2, and a team configured without a slack channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed in 93afc3f:
|
Second pass on review feedback.
- Validate STUCK_DAYS up front: must be a finite integer >= 1 or the
run fails. Without this, parseInt('abc') -> NaN -> cutoff invalid ->
every QA/Done item passes the staleness gate (false positive
storm). Same risk for 0 / -1, which slid cutoff to now/future.
- workflow_dispatch.inputs.stuck_days is now type: number so GH
rejects non-numeric input at dispatch time.
- Labels page-size: bumped to 100 and the query now returns pageInfo.
Issues whose labels overflow page 1 are skipped with a warning
rather than evaluated against an incomplete label set (the
QA : Passed gate could otherwise produce false positives).
- Concurrency group "qa-stuck-check" with cancel-in-progress: false
so a manual dry-run can't overlap a scheduled live post.
- Escape Slack mrkdwn-sensitive chars (&, <, >) in team / status /
title / assignee strings so a title like "Fix <X> & <Y>" doesn't
break the <url|text> link syntax. jq --arg still handles JSON
escaping; this is the orthogonal mrkdwn layer.
- Tests: added invalid-STUCK_DAYS path and labels-overflow path.
Suite is now 9 cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Second-review pass landed in 18197c7: Bugs / risks
Smaller stuff
|
Addresses the two items the second review flagged as fix-before-merge. - find-stuck-issues now drops items where state == 'CLOSED' and the project Status is anything other than Done. A closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without a QA : Not Needed label) — pinging the team about it is noise. Added a test case asserting CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept. - New workflow cicd_pr_qa-stuck-check-validate.yml runs `node .github/scripts/qa-stuck-check/test-find-stuck-issues.js` on PRs that touch the script, scheduled workflow, this validator workflow, or the triage config. Sub-second job (zero npm deps). Without this, regressions would only surface on the next Mon/Thu 13:00 UTC scheduled run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed the two items called out as "actually fix before merging" in 2. ✅ 3. ✅ Tests not in CI — added The 🟡/🟢 items I'm leaving as judgment calls (happy to address any if you want):
Test suite is now 10 cases, all green; |
Summary
cicd_scheduled_qa-stuck-check.ymlruns every 3 days at 13:00 UTC and pings each team's Slack channel about issues stuck in QA.QAorDone, lacks bothQA : PassedandQA : Not Neededlabels, and the project item has not changed in 3+ days.Team : <name>label only. Team→channel mapping was added to.claude/triage-config.json.Behavior
dry_run=true; rendered messages are printed to the job summary instead of posting. Overridestuck_daysto test other thresholds.chat.postMessageusingSLACK_BOT_TOKEN. Per-team posts run in a matrix so one Slack failure doesn't block the others.Auth note
The workflow queries an org-owned ProjectV2 board, which the default
GITHUB_TOKENcannot read. It usessecrets.CI_MACHINE_TOKEN(already used bycicd_scheduled_notify-seated-prs.ymland the slack-channel-resolver). If that token lacks theread:projectscope, the first dry-run will surface a 401 and we'll need to grant it.Test plan
node --checkon the scriptactionlinton the workflownode .github/scripts/qa-stuck-check/test-find-stuck-issues.js) with a fake project payload — verifies Status / label / staleness / team-resolution filters and the per-team groupingworkflow_dispatchwithdry_run=trueon this branch and review the job summary against the actual project stateStatusfield options on project testing assign #7 are exactlyQAandDone(case-insensitive match in the script, but spelling must agree)dry_run=falseonce for a live end-to-end Slack test against the team channels before letting the schedule run unattended🤖 Generated with Claude Code