feat(ci): scheduled QA-stuck issue check with Slack notifications by nollymar · Pull Request #35802 · dotCMS/core

nollymar · 2026-05-21T22:51:45Z

Summary

New scheduled workflow cicd_scheduled_qa-stuck-check.yml runs every 3 days at 13:00 UTC and pings each team's Slack channel about issues stuck in QA.
An issue is considered stuck when, on the dotCMS - Product Planning project (testing assign #7), it has Status QA or Done, lacks both QA : Passed and QA : Not Needed labels, and the project item has not changed in 3+ days.
Team is resolved from the issue's Team : <name> label only. Team→channel mapping was added to .claude/triage-config.json.

Behavior

No issues stuck → the notify job is skipped entirely; no Slack messages are sent.
Manual dispatch → defaults to dry_run=true; rendered messages are printed to the job summary instead of posting. Override stuck_days to test other thresholds.
Scheduled run → posts to each team channel via chat.postMessage using SLACK_BOT_TOKEN. Per-team posts run in a matrix so one Slack failure doesn't block the others.

Auth note

The workflow queries an org-owned ProjectV2 board, which the default GITHUB_TOKEN cannot read. It uses secrets.CI_MACHINE_TOKEN (already used by cicd_scheduled_notify-seated-prs.yml and the slack-channel-resolver). If that token lacks the read:project scope, the first dry-run will surface a 401 and we'll need to grant it.

Test plan

node --check on the script
actionlint on the workflow
Local smoke test (node .github/scripts/qa-stuck-check/test-find-stuck-issues.js) with a fake project payload — verifies Status / label / staleness / team-resolution filters and the per-team grouping
Trigger workflow_dispatch with dry_run=true on this branch and review the job summary against the actual project state
Confirm Status field options on project testing assign #7 are exactly QA and Done (case-insensitive match in the script, but spelling must agree)
Flip dry_run=false once for a live end-to-end Slack test against the team channels before letting the schedule run unattended

🤖 Generated with Claude Code

Adds a workflow that runs every 3 days at 13:00 UTC, queries the dotCMS - Product Planning project (#7) for issues in QA/Done status that lack a QA : Passed / QA : Not Needed label and have not changed in 3+ days, and pings the responsible team's Slack channel. Team is resolved from the issue's "Team : <name>" label; team→channel mapping lives in .claude/triage-config.json. Manual dispatch supports dry-run (default) so the rendered messages can be reviewed before any real Slack post. If no issues qualify, the notify job is skipped and no message is sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T22:51:55Z

❌ Issue Linking Required

This PR could not be linked to an issue. All PRs must be linked to an issue for tracking purposes.

How to fix this:

Option 1: Add keyword to PR body (Recommended - auto-removes this comment)
Edit this PR description and add one of these lines:

This PR fixes #123 or Fixes: #123
This PR closes #123 or Closes: #123
This PR resolves #123 or Resolves: #123
Other supported keywords: fix, fixed, close, closed, resolve, resolved
Option 2: Link via GitHub UI (Note: won't clear the failed check)

Go to the PR → Development section (right sidebar)
Click "Link issue" and select an existing issue
Push a new commit or re-run the workflow to clear the failed check
Option 3: Use branch naming
Create a new branch with one of these patterns:

123-feature-description (number at start)
issue-123-feature-description (issue-number at start)
feature-issue-123 (issue-number anywhere)

Why is this required?

Issue linking ensures proper tracking, documentation, and helps maintain project history. It connects your code changes to the problem they solve.---

This comment was automatically generated by the issue linking workflow

claude · 2026-05-21T22:52:33Z

Claude finished @nollymar's task in 2m 29s —— View job

PR Review

Read the changed files
Analyze the find-stuck-issues.js script for bugs / edge cases
Analyze the workflows
Check the test file coverage
Check the triage-config.json change

Overall the script is well-defended (loud failures for misconfigured Status field / null project / unparseable STUCK_DAYS, labels-pagination skip, multi-team grouping, CLOSED+QA drop). Tests cover the main risk surfaces. Below are the things worth a second look.

🟠 `assertStatusFieldVisible` ratio is computed over all project items, not just Issues

find-stuck-issues.js:173-190 checks withStatus / items.length, where items is the full ProjectV2 page including PullRequest and DraftIssue content. The filter to __typename === 'Issue' happens later, in the per-item loop.

If project #7 contains a meaningful number of PRs or draft items that don't have a Status field value set, the denominator gets diluted and the run can fail with a misleading "field falls outside the GraphQL fieldValues page" error even though every actual Issue has Status.

Two options:

Filter to item.content?.__typename === 'Issue' before computing the ratio, or
Loosen the failure mode (warn instead of fail) and rely on the per-item silent skip.

I'd lean toward (1) — keeps the loud-failure behavior but scopes it to the population the script actually cares about. Fix this →

🟡 `workflow_dispatch` input `stuck_days: 0` short-circuits validation

cicd_scheduled_qa-stuck-check.yml:37:

STUCK_DAYS: ${{ inputs.stuck_days || '3' }}

In GitHub Actions expression syntax, 0 || '3' → '3'. So a user dispatching with stuck_days=0 silently gets the default 3 instead of the loud STUCK_DAYS must be >= 1 failure that validateConfig would produce. The test testInvalidStuckDaysFailsLoudly covers '0' because it sets the env var directly — it doesn't exercise the YAML coalesce path.

Minor (since validating 0 mostly catches typos), but the test gives a false sense of coverage for that specific code path. Either drop the || '3' (rely on default: 3) or use ${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}.

🟡 Slack `esc` only handles the three required chars; titles with `*` / `_` / backticks render with formatting

cicd_scheduled_qa-stuck-check.yml:102-105 escapes &, <, >. Slack docs do say those are the only mandatory escapes, but a real-world issue title like Fix *thing* or add __init__.py will get rendered as bold/italic in the message. Cosmetic, not a bug — flagging because the PR description called out the escaping work explicitly. If you want belt-and-braces, the usual trick is to wrap titles in a <{url}|{text}> link (already done — link text is parsed as mrkdwn but with fewer interpretations) or replace `*_~`` with their entity/zero-width equivalents.

🟡 Job-summary markdown isn't escaped at all

find-stuck-issues.js:209-211 builds:

- [#${e.number}](${e.url}) — ${e.title} · status: ${e.status} ...

A title containing ] followed by ( (rare but possible: Tracking issue [follow-up] (see #X)), or unbalanced backticks, will mangle the GH job summary rendering. Lower stakes than Slack since this is internal-only — worth knowing but probably fine to leave.

🟢 `Post to Slack` step lacks `set -euo pipefail`

cicd_scheduled_qa-stuck-check.yml:147-159 — already acknowledged in PR description point (6). The script will still surface failures via the ok != true check, but a curl-level failure (DNS, TLS, transient 5xx) produces an empty $response, which jq parses fine and the message becomes Slack post failed: (empty). Adding set -euo pipefail + checking curl's exit would give a clearer signal. Cheap fix.

🟢 No Slack retry on 429

A chat.postMessage 429 with Retry-After fails the matrix job (fail-fast: false, so other teams still post). The scheduled cadence is Mon/Thu so it's unlikely to hit rate limits across ~4 teams, but if Slack returns 429 the only recovery is manual re-dispatch. Acceptable for this volume; flagging only because the workflow has no observability on it beyond the failed job.

🟢 Minor observations (no action needed)

Number.isFinite(STUCK_DAYS) && STUCK_DAYS >= 1 accepts 3.7 → parseInt floors it to 3. Already acknowledged.
mkItem test helper defaults state based on status, which is convenient but means the existing tests don't exercise OPEN+Done (a legitimate-but-unusual state — open issue parked in Done). Script reports it; test doesn't assert.
assertStatusFieldVisible short-circuits when items.length === 0, so the null-project / empty-board case correctly skips the ratio check.
parseInt(PROJECT_NUMBER || '7', 10) is fine since the value is hardcoded in the workflow; acknowledged.

Verdict

Tightly written. The one item I'd actually fix before merging is (1) — assertStatusFieldVisible over the unfiltered population is the most likely cause of a future "scheduled run fails for no apparent reason" page. The rest are nice-to-haves.
· Branch

@user

- Cron now '0 13 * * 1,4' (Mon/Thu) instead of '*/3' day-of-month, which produced uneven cadence across month boundaries. - Bumped GraphQL caps: fieldValues 20->50, labels 30->50, assignees 10->20, so the Status / Team labels can't silently fall off the page. - Added a Status-field presence assertion: fails the run if fewer than half of items expose a readable Status (surfaces schema mismatches in dry-run instead of returning empty results). - Slack message wording now says "no project activity for Nd" / "last project update Nd ago" rather than "stuck for Nd", reflecting that ProjectV2Item.updatedAt is a proxy and any project-field edit resets the clock. Same caveat in the job summary. - Dropped "(cc @user)" framing: Slack rendered it as plain text, not an actual mention. Assignees are still listed for context. - Multiple "Team : X" labels on one issue now report the issue under each matched team (previously only the first label-order match got notified). - Added a guard for projectV2 == null (renamed project / token without read:project scope) so dry-runs fail with a clear error rather than a "Cannot read properties of null" stack. - stuck_days is now embedded on each group; matrix builder no longer needs to re-resolve the env var. - Test suite expanded to cover: empty project, multi-team labels, missing Status field (fail loudly), partial-missing Status (warn, continue), null projectV2, and a team configured without a slack channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:07:56Z

Addressed in 93afc3f:

🔴 Cron — 0 13 */3 * * → 0 13 * * 1,4 (Mon/Thu 13:00 UTC). Even cadence, no month-boundary glitches.
🟠 fieldValues(first: 20) — bumped to 50; labels 30→50; assignees 10→20. Added a presence assertion that fails the run when fewer than 50% of items expose a readable Status (surfaces "wrong field name" or "still capped" in dry-run rather than silently returning empty).
🟠 updatedAt semantics — Slack header now reads "N QA/Done issue(s) with no project activity for Xd+ days" and each line "last project update Nd ago"; trailing paragraph spells out the caveat: "Metric: time since the project item last changed (any field). May undercount if other fields were edited recently." Same wording in the job-summary.
🟡 Slack @-cc — dropped, since @name was plain text not a mention. Assignees still listed (· assignees: foo, bar).
🟡 Multi Team label — issue now reports under every matched team, not just first-in-label-order. Test case added.
🟡 STUCK_DAYS propagation — embedded on each group (stuck_days: N); matrix job reads group.stuck_days instead of re-resolving via workflow env.
🟢 Null-project guard — explicit setFailed with a "token may lack read:project scope" hint instead of Cannot read properties of null. Test case added.
🟢 Fragile channel deref — channel is now stored on the team accumulator at insert time, removing the teamChannels[team] re-lookup in grouping.
🟢 Tests — expanded from 1 scenario to 7: main, empty project, multi-team labels, missing-Status loud failure, partial-Status warn-and-continue, null project, team-without-slack-channel ignored. node test-find-stuck-issues.js runs them all; actionlint clean.

Second pass on review feedback. - Validate STUCK_DAYS up front: must be a finite integer >= 1 or the run fails. Without this, parseInt('abc') -> NaN -> cutoff invalid -> every QA/Done item passes the staleness gate (false positive storm). Same risk for 0 / -1, which slid cutoff to now/future. - workflow_dispatch.inputs.stuck_days is now type: number so GH rejects non-numeric input at dispatch time. - Labels page-size: bumped to 100 and the query now returns pageInfo. Issues whose labels overflow page 1 are skipped with a warning rather than evaluated against an incomplete label set (the QA : Passed gate could otherwise produce false positives). - Concurrency group "qa-stuck-check" with cancel-in-progress: false so a manual dry-run can't overlap a scheduled live post. - Escape Slack mrkdwn-sensitive chars (&, <, >) in team / status / title / assignee strings so a title like "Fix <X> & <Y>" doesn't break the <url|text> link syntax. jq --arg still handles JSON escaping; this is the orthogonal mrkdwn layer. - Tests: added invalid-STUCK_DAYS path and labels-overflow path. Suite is now 9 cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:44:15Z

Second-review pass landed in 18197c7:

Bugs / risks

✅ parseInt(STUCK_DAYS) unguarded — added validateConfig: finite integer ≥ 1 or the run fails up front. workflow_dispatch.inputs.stuck_days switched from string to number so GH rejects bad input at dispatch time. Test asserts abc / 0 / -1 all throw (empty stays valid since it's how scheduled triggers signal "use default").
✅ Labels pagination — bumped labels(first: 50) → labels(first: 100), query now returns pageInfo, and issues with hasNextPage: true are skipped with a warning rather than evaluated against an incomplete label set. Test asserts the skip + warning.
✅ Concurrency — added concurrency: { group: qa-stuck-check, cancel-in-progress: false }. Manual dry-run can't race the scheduled run.
⏸ updatedAt proxy false-negative shape (Priority/Iteration bumps reset the clock) — acknowledged limitation; no signal change short of a different data source. Open to revisiting cadence (weekly × 7-day vs Mon/Thu × 3-day) but leaving as-is for now since the wording already calls out "any project-field edit resets the clock".
✅ Slack mrkdwn escaping — &, <, > now escaped in team / status / title / assignee text via an esc helper. jq --arg continues to handle JSON-level escaping; this fixes the orthogonal mrkdwn layer (e.g. titles like Fix <X> & <Y>).

Smaller stuff
6. ✅ inputs.stuck_days type → number.
7. ⏸ STATUS_PRESENCE_MIN_RATIO left hardcoded — one knob, internal heuristic, not worth widening the env surface.
8. ⏸ DST drift in daysStuck — acknowledged, no action.
9-10. ⏸ No action (existing convention).
11. ✅ Tests for NaN/invalid stuck_days and >100 labels both added. Suite is now 9 cases, all green.

actionlint and node --check clean; full test suite passes.

Addresses the two items the second review flagged as fix-before-merge. - find-stuck-issues now drops items where state == 'CLOSED' and the project Status is anything other than Done. A closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without a QA : Not Needed label) — pinging the team about it is noise. Added a test case asserting CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept. - New workflow cicd_pr_qa-stuck-check-validate.yml runs `node .github/scripts/qa-stuck-check/test-find-stuck-issues.js` on PRs that touch the script, scheduled workflow, this validator workflow, or the triage config. Sub-second job (zero npm deps). Without this, regressions would only surface on the next Mon/Thu 13:00 UTC scheduled run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:58:32Z

Addressed the two items called out as "actually fix before merging" in b0f674cb07:

2. ✅ Issue.state drift — find-stuck-issues now drops items where state == 'CLOSED' and project Status is anything other than Done. Rationale: a closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without QA : Not Needed), not pending QA. Test case asserts CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.

3. ✅ Tests not in CI — added cicd_pr_qa-stuck-check-validate.yml. Path-filtered pull_request trigger on the script/workflow/config; checks out the relevant files, sets up Node from .nvmrc, runs the smoke test. Zero npm deps, sub-second job. Concurrency group cancels in-progress on synchronize.

The 🟡/🟢 items I'm leaving as judgment calls (happy to address any if you want):

(1) YAML comment clarifying schedule-doesn't-dry-run — fair, can add.
(4) assignees first: 20 no-overflow guard — display-only impact.
(5) parseInt(PROJECT_NUMBER) validation — env is hard-coded in the YAML, lower likelihood of a typo than user input.
(6) set -euo pipefail in the Slack post step — defensive, harmless.
(8) Number.isInteger(STUCK_DAYS) to reject decimals — minor.
(9) Project-not-found error fires after retries: 3 — known limitation of the github-script wrapper; acceptable since the result is "fail with clear message after a few seconds."

Test suite is now 10 cases, all green; actionlint clean on both workflows.

nollymar requested a review from a team as a code owner May 21, 2026 22:51

github-project-automation Bot added this to dotCMS - Product Planning May 21, 2026

github-actions Bot added the Area : CI/CD PR changes GitHub Actions/workflows label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802
nollymar wants to merge 4 commits into
mainfrom
worktree-qa-stuck-prs-automation

nollymar commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

claude Bot commented May 21, 2026 •

edited

Loading

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nollymar commented May 21, 2026

Summary

Behavior

Auth note

Test plan

Uh oh!

github-actions Bot commented May 21, 2026

❌ Issue Linking Required

How to fix this:

Why is this required?

Uh oh!

claude Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

🟠 assertStatusFieldVisible ratio is computed over all project items, not just Issues

🟡 workflow_dispatch input stuck_days: 0 short-circuits validation

🟡 Slack esc only handles the three required chars; titles with * / _ / backticks render with formatting

🟡 Job-summary markdown isn't escaped at all

🟢 Post to Slack step lacks set -euo pipefail

🟢 No Slack retry on 429

🟢 Minor observations (no action needed)

Verdict

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 21, 2026 •

edited

Loading

🟠 `assertStatusFieldVisible` ratio is computed over all project items, not just Issues

🟡 `workflow_dispatch` input `stuck_days: 0` short-circuits validation

🟡 Slack `esc` only handles the three required chars; titles with `*` / `_` / backticks render with formatting

🟢 `Post to Slack` step lacks `set -euo pipefail`