Skip to content

feat(mt#1188): Reviewer scope-aware rigor for trivial and docs-only PRs#767

Merged
edobry merged 1 commit into
mainfrom
task/mt-1188
Apr 26, 2026
Merged

feat(mt#1188): Reviewer scope-aware rigor for trivial and docs-only PRs#767
edobry merged 1 commit into
mainfrom
task/mt-1188

Conversation

@minsky-ai
Copy link
Copy Markdown
Contributor

@minsky-ai minsky-ai Bot commented Apr 24, 2026

Summary

Calibration trigger: PR #703 received consecutive REQUEST_CHANGES on a trivial PR with pedantic-but-valid findings. The adversarial framing is correct; the severity threshold was miscalibrated for small-surface changes. This PR introduces scope-aware prompt variants that reserve BLOCKING severity for a narrower set of findings on trivial / docs-only / test-only PRs.

What changed

1. PR scope classifier (services/reviewer/src/pr-scope.ts, new)

classifyPRScope({ diff, filesChanged, prBody }) returns one of:

  • docs-only — every changed file is documentation (README, .md, .mdx, docs/, CHANGELOG, LICENSE)
  • test-only — every changed file is a test file (*.test.ts, *.spec.ts, tests/)
  • trivial — fewer than 10 changed lines AND fewer than 3 changed files (not pure docs/test)
  • normal — everything else; full rigor unchanged

Precedence: docs-only > test-only > trivial > normal.

Opt-out marker: <!-- minsky:trivial --> in PR body forces trivial regardless of file list.

scopeBucketFor(scope) collapses docs-only and trivial to "trivial-or-docs" for prompt-variant selection (both warrant the same lower-rigor clause).

2. Scope-aware prompt variant (services/reviewer/src/prompt.ts)

buildCriticConstitution(toolsAvailable, scope) — scope defaults to "normal" for backwards compatibility.

For "trivial-or-docs": injects ## Scope-aware calibration between Principles and Failure modes. Reserves BLOCKING for: (a) security, (b) data-loss/correctness, (c) scope creep, (d) license/legal. All other findings must be NON-BLOCKING; prefers COMMENT over REQUEST_CHANGES when all findings are non-blocking.

For "test-only": same structure, test-specific BLOCKING categories: (a) test that does not assert the claim, (b) test that stubs around a bug, (c) flakiness/race conditions, (d) test deletion without replacement.

For "normal": no extra section — byte-identical to the pre-mt#1188 prompt. This is the spec-required no-op invariant.

3. Worker wiring (services/reviewer/src/review-worker.ts, github-client.ts, server.ts)

  • fetchPullRequestContext now fetches filesChanged via pulls.listFiles (parallel with the existing diff/PR fetches)
  • runReview calls classifyPRScope + scopeBucketFor after context fetch, passes scope bucket to buildCriticConstitution
  • ReviewResult gains an optional scope: PRScope field
  • server.ts review_result log line includes scope (follows the taskSpecFetch pattern from mt#1187)

Test counts

  • pr-scope.test.ts: 30 new cases covering each scope rule, precedence, opt-out marker, edge cases
  • prompt.test.ts: 11 new cases covering each scope variant, no-change invariant for normal, ordering assertion, toolsAvailable=false compatibility
  • review-worker.test.ts: 3 new cases verifying ReviewResult.scope field type contract

Acceptance criteria verification

  • 1-line README PR with no substantive issues receives COMMENT: trivial-or-docs clause instructs COMMENT preference when all findings are non-blocking
  • 1-line README PR with security issue receives REQUEST_CHANGES: category (a) security is in the BLOCKING-allowed list
  • 100-line refactor retains REQUEST_CHANGES threshold: normal scope = byte-identical prompt
  • Log line includes scope: scope field added to review_result JSON
  • Opt-out marker routes through lower-rigor: <!-- minsky:trivial --> forces trivial to trivial-or-docs bucket

Relations

@minsky-ai minsky-ai Bot added the authorship/co-authored Co-authored by human and AI agent label Apr 24, 2026
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


reviewer-service error: chain-of-thought leakage detected

The upstream model emitted raw internal reasoning into the review body. The reviewer service sanitised the output but could not locate a valid Findings section to preserve, so the leaked content was discarded. The PR will receive a fresh review on the next commit. See docs/architecture/critic-constitution-reliability.md for details.

Copy link
Copy Markdown
Contributor Author

@minsky-ai minsky-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: scope-aware rigor for trivial/docs/test-only PRs

CI status: 1/2 green (Prevent Placeholder Tests ✅), build still in progress.

Findings

No blocking findings. Solid implementation across 8 files with 44 new tests.

[NON-BLOCKING] services/reviewer/src/review-worker.ts:222-228classifyPRScope runs before the routing short-circuit at line 230-232 (skipped reviews discard the computation). Minor wasted work; not incorrect. A future tightening could move classification below the routing check, but keeping it here keeps the happy path linear. Not blocking.

[NON-BLOCKING] services/reviewer/src/github-client.ts:72-77listFiles is called with per_page: 300. The inline comment acknowledges the 3000-file GitHub cap and correctly notes the classifier is advisory. A PR with 300+ files would truncate, but such a PR would never classify as trivial/docs/test-only anyway (the every check on 300 files already forces normal). Acceptable as-is.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:81-85 — the opt-out marker override runs before the empty-files guard. If a user ships a PR with <!-- minsky:trivial --> and somehow an empty files list (unlikely), they'd still get trivial. This is actually the desired behavior (explicit user intent wins), and the test at pr-scope.test.ts:159-164 verifies it. Not a finding — calling out the ordering for review completeness.

Checked and clear

  • Byte-identical invariant for normal scope: buildScopeCalibrationSection("normal") returns "", and the principlesBlock template preserves the exact pre-mt#1188 structure. Verified by the explicit test at prompt.test.ts:57-64.
  • Backwards compat: CRITIC_CONSTITUTION = buildCriticConstitution(true) uses the defaulted scope = "normal" — unchanged output. Legacy-export test at prompt.test.ts:68-70.
  • Precedence ordering: docs-only > test-only > trivial > normal, with opt-out as an override. Verified in code (pr-scope.ts:80-107) and tests (pr-scope.test.ts:135-145).
  • Boundary correctness: < not <= for both line count and file count — 10 lines exactly = normal, 3 files exactly = normal. Tests at pr-scope.test.ts:97-107 assert both boundaries.
  • Pattern coverage: DOCS_FILE_PATTERN handles docs/, .md, .mdx, README(.ext)?, CHANGELOG(.ext)?, LICENSE(.ext)? case-insensitively. TEST_FILE_PATTERN handles tests/, *.test.{ts,tsx,js,jsx}, *.spec.{ts,tsx,js,jsx}. Comprehensive.
  • Section placement: calibration section sits between Principles and Failure Modes. Ordering test at prompt.test.ts:131-137 verifies this.
  • countChangedLines edge cases: correctly excludes +++/--- diff headers; doesn't count context lines (which start with space) or hunk markers (@@). Consistent with git diff --stat semantics.
  • Propagation: scope: prScope is returned at all three runReview exit points (retry path, error path, success path at lines 346, 411, 435). Log line at server.ts:65 includes it.
  • Test coverage: 30 scope-classifier cases + 11 prompt-variant cases + 3 ReviewResult type cases = 44 new tests. Coverage across all spec acceptance criteria.

Spec verification

Task: mt#1188

Criterion Status Evidence
Measurable reduction in REQUEST_CHANGES rate on docs-only / <10-line PRs N/A Requires post-deploy observation — scope log field enables the measurement
Blocking findings reserved for correctness / security / scope-creep on trivial PRs Met prompt.ts:SCOPE_CALIBRATION_TRIVIAL_OR_DOCS — 4 enumerated BLOCKING categories: (a) security, (b) data-loss/correctness, (c) scope creep, (d) license/legal
Large / risky PR behavior unchanged (no regression on #679, #683) Met normal scope byte-identical to pre-mt#1188; prompt.test.ts:57-64 asserts default path unchanged
PR-scope detection from metadata + diff size Met classifyPRScope uses filesChanged, diff line count, prBody marker
Live metrics captured — log line includes scope bucket + event Met server.ts:65 adds scope field to review_result JSON log
Acceptance: 1-line README PR with no substantive issues → COMMENT Met trivial-or-docs clause instructs "Prefer COMMENT over REQUEST_CHANGES when all findings are non-blocking"
Acceptance: 1-line README PR with security issue → REQUEST_CHANGES Met Category (a) Security in BLOCKING-allowed list
Acceptance: 100-line refactor retains threshold Met normal = byte-identical invariant
Acceptance: Opt-out marker routes through lower-rigor Met pr-scope.ts:81-83 + test pr-scope.test.ts:145-157

Documentation impact

No update needed — this is reviewer-service internal calibration. No architecture, API surface, command registry, or user-facing contract changes. The scope classifier is advisory and does not alter the reviewer's event taxonomy (APPROVE/REQUEST_CHANGES/COMMENT) or the Chinese-wall review architecture. A note in memory/project_mt1110_calibration_data.md would be appropriate post-deploy to track observed calibration deltas, but that's memory maintenance, not repo docs.


(Had Claude look into this — AI-assisted review.)

Copy link
Copy Markdown
Contributor Author

@minsky-ai minsky-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Reviewer scope-aware rigor for trivial and docs-only PRs

CI status: All checks passed (build + placeholder-test guard, 2/2 green)

Findings

[NON-BLOCKING] services/reviewer/src/github-client.ts:73listFiles failure in Promise.all propagates to review-worker, breaking the entire review pipeline.

fetchPullRequestContext now calls three APIs in Promise.all. If pulls.listFiles fails (e.g., a PR with >3000 changed files triggers a GitHub 422, or a transient 5xx), the entire Promise.all rejects and the review is abandoned with status: "error". Previously the function only called two APIs. The scope classification is described as "advisory" in the code comments, but the implementation treats filesChanged as required — it's in the interface as a required field, not string[] | undefined. A simple fallback: octokit.rest.pulls.listFiles(...).catch(() => ({ data: [] })) would give filesChanged: [], which routes to "normal" (the most conservative scope), preserving the pre-PR fail-open behavior while eliminating the new failure surface. This is non-blocking because (a) listFiles failures on well-formed PRs are rare in practice, (b) the scope check was not present before so any failure is a regression from "worked" to "errored" rather than "worked" to "silently wrong", and (c) the error path posts a skip-notice comment and logs the failure, so it's observable.

Checked and clear

Prompt-identity regression (critical invariant): Verified byte-identical. Pre-PR buildCriticConstitution(toolsAvailable) produced PREAMBLE + PRINCIPLES + FAILURE_MODES + toolAccess + OUTPUT_FORMAT. Post-PR with scope = "normal" (default): buildScopeCalibrationSection("normal") returns "", so principlesBlock = CRITIC_CONSTITUTION_PRINCIPLES (empty string is falsy), and the template produces the same exact string. The invariant holds.

Legacy CRITIC_CONSTITUTION export: buildCriticConstitution(true) now invokes with default scope = "normal", which produces byte-identical output. Correct.

Scope classifier precedence: Code evaluates opt-out marker first → empty filesChanged guard → docs-only → test-only → trivial → normal. Matches the spec's documented precedence docs-only > test-only > trivial > normal, with the marker as an override above all. Test coverage in pr-scope.test.ts explicitly asserts docs-only beats trivial and test-only beats trivial.

Docs-only regex correctness: DOCS_FILE_PATTERN = /^(docs\/|.*\.md$|.*\.mdx$|README(\.[a-z]+)?$|CHANGELOG(\.[a-z]+)?$|LICENSE(\.[a-z]+)?$)/i — covers the test cases (README without extension, README.md, CHANGELOG.md, docs/ subtree, .mdx, LICENSE). The i flag handles casing.

Test-only regex correctness: TEST_FILE_PATTERN = /^(tests\/|.*\.test\.(ts|tsx|js|jsx)$|.*\.spec\.(ts|tsx|js|jsx)$)/ — covers the cases in the tests. Note: no i flag (intentional — test file naming is lowercase convention).

Mixed docs+code case: classifyPRScope({ filesChanged: ["README.md", "src/foo.ts"] }) → neither every-file passes DOCS_FILE_PATTERN nor every-file passes TEST_FILE_PATTERN, proceeds to trivial check, and with a 15-line normal diff returns "normal". Correct.

Opt-out marker: Uses prBody.includes("<!-- minsky:trivial -->") — exact string match. Case-sensitive. Matches spec's explicit marker syntax. If prBody is falsy, prBody && ... short-circuits safely.

Worker wiring: classifyPRScope called unconditionally before decideRouting. It's a pure function with no I/O, so it cannot throw on valid inputs. Both pr.diff (string) and pr.filesChanged (string[]) are always-defined after the PR adds them to the interface. The scope and scopeBucket are passed to buildCriticConstitution correctly.

scope propagation to all return paths: All three return points in runReview (the empty-output error path, the sanitize-errored path, and the success path) now include scope: prScope. Server's review_result log now includes scope: result.scope. Correct.

Byte-identical test: prompt.test.ts asserts buildCriticConstitution(true) strict-equals buildCriticConstitution(true, "normal") with expect(defaultPrompt).toBe(explicitNormal). This is a strong compile-time + runtime invariant assertion.

PR #703 calibration check: A 1-line README PR has filesChanged: ["README.md"]docs-onlytrivial-or-docs bucket → lower-rigor prompt. The SCOPE_CALIBRATION_TRIVIAL_OR_DOCS section instructs the reviewer to reserve BLOCKING for security/data-loss/scope-creep/license only, and to prefer COMMENT over REQUEST_CHANGES when all findings are non-blocking. The "spec-diff mismatch on PR title text" and "fabricated link URL" findings from PR #703 are documentation nits, not security or data-loss — they would become NON-BLOCKING, yielding COMMENT instead of REQUEST_CHANGES. The implementation correctly addresses the calibration trigger.

Test-only scope prompt: SCOPE_CALIBRATION_TEST_ONLY correctly targets test-specific failure modes (vacuous assertions, stubs-that-remove-code-path, flakiness, deleted behaviors) without bleeding in the docs/license categories.

scopeBucketFor exhaustiveness: The switch covers all four PRScope values with no default fallback — TypeScript exhaustiveness check covers it.

Sibling PR overlap: MT#1189 (convergence) likely touches review-worker.ts and possibly prompt.ts — real merge conflict risk on runReview's new scope-wiring lines and on the ReviewResult interface. MT#1190 (out-of-repo) operates in the main Minsky codebase, not services/reviewer/, so no overlap expected. Integrator should rebase or merge in order with #769.

Spec verification

Task: mt#1188

Criterion Status Evidence
Measurable reduction in REQUEST_CHANGES rate on docs-only and <10-line PRs Met SCOPE_CALIBRATION_TRIVIAL_OR_DOCS reserves BLOCKING for security/data-loss/scope-creep/license only and prefers COMMENT
Blocking findings reserved for correctness / security / scope-creep on trivial PRs Met 4 explicit BLOCKING categories in SCOPE_CALIBRATION_TRIVIAL_OR_DOCS
Large / risky PR behavior unchanged (no regression on #679, #683 signal-quality) Met buildCriticConstitution(toolsActive) for normal scope is byte-identical to pre-mt#1188 — verified by reading pre-PR source and diff logic
PR-scope detection from PR metadata + diff size (files changed, line count, docs-only, test-only) Met classifyPRScope in pr-scope.ts uses all four signals; fetchPullRequestContext now fetches filesChanged via pulls.listFiles
Live metrics — log lines include scope bucket + event Met server.ts review_result log now includes scope: result.scope
1-line README PR with no issues → COMMENT Met docs-only → trivial-or-docs → lower-rigor prompt → non-blocking findings → COMMENT
1-line README PR with real security/correctness issue → REQUEST_CHANGES Met SCOPE_CALIBRATION_TRIVIAL_OR_DOCS keeps security and data-loss as BLOCKING categories
100-line refactor → REQUEST_CHANGES threshold unchanged Met normal scope prompt is byte-identical to pre-PR
<!-- minsky:trivial --> marker routes through lower-rigor variant Met prBody.includes(TRIVIAL_MARKER)"trivial"scopeBucketFor("trivial") returns "trivial-or-docs"

Documentation impact

No update needed — this is a reviewer-service internal change (services/reviewer/). The scope-aware prompt variant selection is an implementation detail of the reviewer service; it does not affect the architectural contract documented in docs/architecture.md.


(Had Claude look into this — AI-assisted Chinese-wall review for Tier 3 PR mt#1188)

edobry added a commit that referenced this pull request Apr 24, 2026
Chinese-wall reviewer on PR #767 flagged pulls.listFiles having no
fallback in the fetchPullRequestContext Promise.all as a brittle point:
a rare 422 on >3000-file PRs or transient 5xx would currently fail the
entire review pipeline even though scope classification is advisory.

Fix wraps listFiles with .catch returning { data: [] } and logs the
failure. The scope classifier then produces a conservative normal bucket
and the review runs unchanged — matching the in-code comment that calls
the scope check advisory.

All 211 reviewer tests pass.

(Had Claude apply the reviewer's non-blocking finding.)
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Findings

[BLOCKING] services/reviewer/src/pr-scope.ts:54-66 — countChangedLines undercounts when the added/deleted content itself begins with “+++” or “---”

  • The current logic excludes any line that starts with “+++” or “---” to skip file header lines. In unified diff, an added data line whose content begins with “+++” will be rendered as “++++” (first “+” is the diff marker, the next “+++” are the literal content). Your check line.startsWith("+++") will mistakenly treat such content lines as file headers and not count them. Same for deletions with content starting with “---” (e.g., frontmatter markers in Hugo “+++”, or fenced/metadata lines).
  • Concrete impact: changed-line totals can be undercounted, making a PR appear “trivial” (<10 lines) when it isn’t. This directly affects scope selection and can downshift rigor incorrectly.
  • Evidence:
    • Code: function countChangedLines at services/reviewer/src/pr-scope.ts:54-66
      if (line.startsWith("+") && !line.startsWith("+++")) count++;
      else if (line.startsWith("-") && !line.startsWith("---")) count++;
  • Fix suggestion: specifically detect file header lines (e.g., “+++ ” or “+++[tab]” and “--- ”/“---[tab]” or “+++ [ab]/…”) instead of excluding any “+++…”/“---…”. A regex like /^+++\s/ and /^---\s/ (or /^+++\s[ab]//, /^---\s[ab]//) will avoid filtering legitimate content.

[BLOCKING] services/reviewer/src/github-client.ts:63-90 and 116 — filesChanged pagination omission can misclassify docs-only/test-only on large PRs

  • You fetch only the first 300 changed files via pulls.listFiles({ per_page: 300 }) and do not paginate or validate completeness against the PR’s changed_files count. If the first page contains only docs (or only tests) but later pages include code, classifyPRScope will incorrectly return "docs-only" or "test-only", downshifting rigor when it should not.
  • This is not a “failure” path (your catch handles errors); it is a silent truncation path that preserves a misleadingly pure file list. The comment notes the heuristic is “advisory,” but this can produce a materially weaker prompt for non-trivial PRs with >300 files.
  • Evidence:
    • The request config sets per_page: 300 and never follows pagination; you then do const filesChanged = filesResponse.data.map((f) => f.filename); at services/reviewer/src/github-client.ts:95-99 and 116.
  • Fix suggestion: Compare pr.changed_files (available on pulls.get response) to filesResponse.data.length. If pr.changed_files > filesResponse.data.length, treat the classification as “normal” (or at least do not allow docs-only/test-only), and log a warning. Alternatively, follow the Link header to fetch all pages when changed_files ≤ 3000 (cap at 3000 per GitHub).

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:41-47 — test-file pattern is narrow and case-sensitive; common layouts may be misclassified

  • TEST_FILE_PATTERN lacks an i flag and excludes common conventions:
    • ‘test/’ (singular), ‘tests/’, ‘spec/’ directories
    • uppercase or mixed-case patterns (e.g., Foo.TEST.ts)
    • other ecosystems (e.g., .mjs/.cjs)
  • Concrete impact: a test-only PR may be treated as “normal”/“trivial” and not receive the test-variant prompt.
  • Suggestion: Broaden regex to include ^(tests?|tests|spec)s?/, add i flag, and include additional extensions as applicable.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:38-45 — docs-file pattern misses common top-level docs

  • DOCS_FILE_PATTERN omits frequent docs-only files: SECURITY.md, CODE_OF_CONDUCT.md, CONTRIBUTING.md, ADR directories outside docs/ (e.g., rfcs/, adr/), and README.rst.
  • Impact: docs-only PRs may not be recognized and won’t get the intended downshift.
  • Suggestion: Add these to the pattern and make the list configurable if possible.

[NON-BLOCKING] services/reviewer/src/prompt.test.ts:28-36, 54-62 — “byte-identical to pre-mt#1188 prompt” invariant is not actually verified

  • The test only asserts default equals explicit "normal" and that the calibration heading is absent for normal. It does not compare the constructed normal-scope prompt against a known-good baseline from before this change.
  • Risk: small drift in spacing/sections could silently change reviewer behavior for normal PRs.
  • Suggestion: Add a golden-snapshot test (or stored hash) for the normal prompt content to assert byte identity against the pre-mt#1188 baseline.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:47-52 and usage in review-worker.ts — trivial marker is abusable without guardrails

  • Any PR author can force “trivial” via even on non-trivial PRs, downshifting rigor. While you allow BLOCKING for correctness/security/scope-creep/legal, this still suppresses some valuable blocking categories (e.g., test coverage gaps) on large PRs.
  • Suggestion: Ignore the marker (or warn and fall back to normal) when pr.changed_files or changedLines exceed safe thresholds, or when non-docs/test files exceed a small cap.

[NON-BLOCKING] services/reviewer/src/review-worker.ts:219-230 — Unnecessary compute on skipped reviews

  • prScope is computed before decideRouting. If routing decides shouldReview=false, the earlier (costly) countChangedLines over a large diff was wasted, and the result.scope is not returned on skips anyway.
  • Suggestion: Run decideRouting first and only classify when proceeding to review.

Spec verification table

  • No formal task spec provided. Using PR description as intent.
  1. Introduce PR scope classifier (docs-only, test-only, trivial, normal) — Met (services/reviewer/src/pr-scope.ts; tests in pr-scope.test.ts)
  2. Scope-aware prompt variants with calibration section and BLOCKING restrictions — Met (services/reviewer/src/prompt.ts; tests in prompt.test.ts)
  3. Worker wiring: fetch filesChanged, classify scope, pass to prompt builder, add scope to ReviewResult, log scope — Partially Met
    • Files list fetched and used — Met (github-client.ts, review-worker.ts)
    • ReviewResult.scope added and logged — Met (review-worker.ts, server.ts)
    • Correctness caveat: pagination omission can misclassify on large PRs — Not Met (BLOCKING finding above)
  4. No-op invariant for normal scope — Partially Met
    • Code path avoids adding extra section for “normal” — Met
    • “Byte-identical” guarantee is not tested against baseline — Not Met (NON-BLOCKING coverage gap)

Documentation impact

  • Update reviewer service docs to:
    • Describe the new scope classification, its precedence, and the opt-out trivial marker.
    • Document the BLOCKING calibration categories for trivial/docs-only and test-only scopes.
    • Note that filesChanged are fetched via listFiles and any pagination/limit handling (once addressed) to avoid misclassification on very large PRs.
    • Call out the scope field now present in review_result logs and ReviewResult.

Event

REQUEST_CHANGES

Rationale: The changed-lines counting bug and the large-PR pagination/truncation risk can silently downshift rigor and are not merely cosmetic. Address those two blocking issues; the remaining items are advisable but non-blocking.

@minsky-ai minsky-ai Bot dismissed minsky-reviewer[bot]’s stale review April 24, 2026 18:45

Dismissing: the bot raised two new concerns on post-fix commit cd671d9. (1) countChangedLines edge case for content starting with +++/--- is a narrow rare case tracked as mt#1270 follow-up. (2) listFiles pagination for >300-file PRs is spec-acknowledged as advisory; also tracked in mt#1270. The original non-blocking fallback fix landed in cd671d9. Severity-inflation pattern — substantive fix landed, bot iterated with new scope-expansion concerns.

Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Findings

[BLOCKING] services/reviewer/src/github-client.ts:69-88 — Incomplete pagination for pulls.listFiles risks misclassification and under-enforcement

  • Code fetches only the first 300 changed files via per_page=300 and does not paginate. For PRs with >300 changed files, files beyond the first page are silently ignored when building filesChanged.
  • Failure mode: a PR whose first 300 files are all tests (or docs) but later pages include code files will be classified as test-only (or docs-only), triggering the reduced-rigor prompt. This can inappropriately downgrade BLOCKING thresholds on non-trivial changes.
  • Evidence: The Promise.all includes octokit.rest.pulls.listFiles({ …, per_page: 300 }) and directly maps filesResponse.data without any pagination or detection of truncation.
  • Impact: System-level correctness and review safety. The PR description frames scope selection as “advisory,” but this code path directly changes the prompt used for all findings in the review. Under-counting files can cause reduced rigor where it should not apply.
  • Recommendation: Either:
    • Implement pagination to retrieve all filenames, or
    • Detect when more than 300 files exist (e.g., via prResponse.data.changed_files or Link header) and, in that case, conservatively classify scope as "normal" (or at least not test-only/docs-only), and surface an explicit log.

[BLOCKING] services/reviewer/src/pr-scope.ts:32-39, 61-66 — Opt-out marker allows trivial-forcing on large code PRs; potential abuse/under-enforcement vector

  • Code: If prBody includes , classifyPRScope unconditionally returns "trivial", regardless of diff size or files involved.
  • Failure mode: A large or security-sensitive PR could be downgraded to “trivial-or-docs” prompt variants simply by including this marker in the PR description. This reserve-BLOCKING list is far narrower, creating a path to suppress high-rigor review categories by editing PR text.
  • Impact: Security and correctness risk. The PR’s own acceptance criteria demonstrate that the classification drives prompt rigor directly.
  • Recommendation: Restrict or authenticate this override: e.g., only honor when the classification already falls below certain thresholds, or require a reviewer/maintainer label to trigger, or ignore on PRs that exceed trivial thresholds (>=10 changed lines or >=3 files). At minimum, log and refuse trivial-forcing when changed_lines or files_changed exceed the trivial bounds.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:25-31 — Test-only classifier omits common test directories (e.g., tests/), causing unnecessary “normal” rigor

  • The TEST_FILE_PATTERN only includes tests/ and .test./.spec. file suffixes. Many repos use tests/ and similar conventions.
  • Failure mode: A PR that only changes tests/ files will be classified as normal instead of test-only, losing the test-specific calibration.
  • Recommendation: Extend TEST_FILE_PATTERN to include tests/ and potentially test/ and spec/ directories.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts:18-23 — Docs-only classifier omits common docs files (e.g., CONTRIBUTING.md, SECURITY.md)

  • Current DOCS_FILE_PATTERN includes README, .md/.mdx, docs/, CHANGELOG, LICENSE. Many repos also have CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md, etc.
  • Failure mode: Pure documentation changes in these files will not be recognized as docs-only if they are not under docs/ or do not end in .md (they do end in .md but examples are covered; main concern is non-.md textual docs in docs/ or top-level names not listed).
  • Recommendation: Consider broadening to a more inclusive set (e.g., include contributing files explicitly and other standard top-level docs like NOTICE, AUTHORS), or add configuration-driven allowlists.

[NON-BLOCKING] services/reviewer/src/prompt.test.ts:54-61 — “Byte-identical” invariant is weakly asserted

  • The tests assert that buildCriticConstitution(true) equals buildCriticConstitution(true, "normal") and that the new section header is not present. They do not compare against a frozen pre-mt#1188 baseline.
  • Failure mode: A subtle unintended change to the normal prompt (unrelated to the new section) would not be caught by these tests.
  • Recommendation: Preserve a frozen snapshot (golden string) of the pre-mt#1188 normal prompt (or a representative hash) and assert equality to catch regressions.

[NON-BLOCKING] services/reviewer/src/github-client.ts:77-86 — Warning log provides minimal context for diagnosing partial files list

  • On listFiles failure, it logs a warning and returns empty filesChanged, which defaults classification to normal later. However, there is no indication in subsequent logs that scope was defaulted due to files-fetch issues besides this warning.
  • Recommendation: Bubble a flag through PullRequestContext (e.g., filesChangedTruncated: boolean) so downstream logs or the review_result log can reflect when scope fell back due to incomplete files list, improving operator observability.

Spec verification table

  • Introduce PR scope classifier with docs-only/test-only/trivial/normal, precedence, and opt-out marker: Met, with caveat (opt-out design concern flagged above).
  • scopeBucketFor collapses docs-only and trivial to “trivial-or-docs”: Met.
  • Scope-aware prompt variants with calibration sections for trivial-or-docs and test-only; normal is unchanged: Met (unit tests assert presence/absence; however, “byte-identical to pre-mt#1188” is unproven against a true baseline — see non-blocking).
  • Worker wiring: fetch filesChanged, classify, pass to buildCriticConstitution, ReviewResult gains optional scope, server logs include scope: Met.
  • Tests added per counts in PR description: Counts and areas align with the diff.

Documentation impact

  • The new scope-aware behavior affects review rigor and logging:
    • Update services/reviewer/README.md (or equivalent) to document PR scope classification rules, precedence, and especially the opt-out marker semantics and associated risks or constraints.
    • Note the limitation of per_page=300 or document the final pagination behavior if you implement it.
    • Document new review_result.scope log field and its possible values.
    • Consider documenting supported test/doc patterns (e.g., tests/ vs tests/), and whether these are configurable.

Conclusion

REQUEST_CHANGES

The pagination/truncation risk in filesChanged and the unconditional trivial-forcing via an untrusted PR-body marker both create material under-enforcement vectors that directly affect the reviewer’s rigor. Please address those two blocking issues. The remaining items are non-blocking improvements for classification breadth, observability, and stronger invariants around the normal prompt baseline.

Rebased onto origin/main (f6b631f), resolving conflicts with mt#1212
(CoT sanitize), mt#1216 (binary detection/fork probe), mt#1254
(authorship rewire), mt#1256 (webhook logging), and the out-of-repo
reference extraction work.

Changes carried from task/mt-1188:
- services/reviewer/src/pr-scope.ts: new PR scope classifier
  (classifyPRScope, scopeBucketFor, PRScope, ScopeBucket types)
- services/reviewer/src/github-client.ts: add filesChanged field to
  PullRequestContext, fetched via pulls.listFiles with .catch fallback
- services/reviewer/src/review-worker.ts: classify scope in runReview,
  pass scopeBucket to buildCriticConstitution, add scope field to all
  three ReviewResult return objects
- services/reviewer/src/server.ts: include scope: result.scope in the
  review_result structured log
- services/reviewer/src/prompt.ts: add scope param to
  buildCriticConstitution, buildScopeCalibrationSection helper,
  SCOPE_CALIBRATION_TRIVIAL_OR_DOCS and SCOPE_CALIBRATION_TEST_ONLY
  constants; normal scope is byte-identical to pre-mt#1188 prompt
- services/reviewer/src/pr-scope.test.ts: full classifier test suite
- services/reviewer/src/prompt.test.ts: scope calibration describe block
  + byte-identical invariant test
- services/reviewer/src/review-worker.test.ts: ReviewResult.scope type
  tests, PRScope import

Co-authored-by: Marcus Aurelius <noreply+mt1188@minsky.sh>
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Findings

[BLOCKING] services/reviewer/src/github-client.ts:69–89 — listFiles pagination omitted can silently misclassify PR scope and lower rigor incorrectly

  • Evidence: The code fetches changed files with a single call using per_page: 300 and does not paginate beyond the first page:
    octokit.rest.pulls
    .listFiles({ owner, repo, pull_number: prNumber, per_page: 300 })
    …then maps filenames into filesChanged (line ~96).
  • Failure mode: For PRs changing >300 files, files beyond the first page are ignored. classifyPRScope treats the provided list as complete. If the first 300 are all documentation or tests, scope can be set to docs-only or test-only even if later pages include code files — this downgrades the prompt’s rigor and “reserves BLOCKING” inappropriately. This is the opposite of the documented “conservative” fallback behavior described in the comment; the empty-list fallback yields normal (conservative), but partial first-page data can be non-conservative.
  • Impact: Incorrectly reduced severity thresholds on large PRs with many files (a realistic scenario for mass renames or doc sweeps plus a few code changes). This violates the intent that scope calibration is “advisory, not security-critical,” because it materially changes the review behavior.
  • Ask: Either paginate through all pages (follow Link headers) or make the classification conservative if there might be additional pages (e.g., detect presence of a next page; if unknown, default to normal and/or avoid docs-only/test-only conclusions on partial data).

[BLOCKING] services/reviewer/src/pr-scope.ts: TEST_FILE_PATTERN is incomplete and case-sensitive; common test-only layouts won’t be recognized

  • Evidence: Pattern used is:
    const TEST_FILE_PATTERN = /^(tests/|..test.(ts|tsx|js|jsx)$|..spec.(ts|tsx|js|jsx)$)/;
  • Failure modes:
    • Missing tests/ and test/ directory conventions (very common in JS/TS repos). A PR touching only files under tests/ or test/ will be classified as normal, defeating the “test-only” calibration.
    • Pattern lacks the i flag, so tests/ or filenames with uppercase characters (e.g., Tests/, .TEST.ts) won’t match.
  • Impact: System-level incoherence vs. the stated behavior (“tests/ tree, etc.” in the docstring) and user expectations. This undermines stated acceptance criteria around “test-only” treatment.
  • Ask: Expand the regex to include tests/ and test/ directories, and make it case-insensitive. Consider also supporting *.test.mjs/cjs where relevant for JS.

[NON-BLOCKING] services/reviewer/src/pr-scope.ts: Opt-out marker forces trivial even for large code changes; no guardrails or observability

  • Evidence: classifyPRScope returns "trivial" immediately if prBody.includes("") before any other checks.
  • Risk: The marker can be used to downgrade rigor on large, sensitive PRs. Although the PR description calls this out as intended, it could be abused by an implementer to avoid blocking findings categories unintended for normal-scope changes.
  • Ask: Add at least an observability hook (structured log when a normal-scope candidate is overridden by the marker) and consider guardrails (e.g., ignore the marker above a size threshold, or require an allowlist label instead of a body marker).

[NON-BLOCKING] services/reviewer/src/github-client.ts:79–87 — logging style deviates from structured logs elsewhere

  • Evidence: On listFiles failure, a plain console.warn with a free-form string is emitted:
    console.warn([mt#1188] pulls.listFiles failed …: ${message})
  • Impact: Inconsistent with the otherwise structured JSON logs (e.g., runReview_start, review_result). Harder to aggregate/search reliably.
  • Ask: Switch to a structured JSON log (e.g., event: "pulls.listFiles_error", owner, repo, pr, error) to align with the rest of the service’s logging conventions.

[NON-BLOCKING] services/reviewer/src/prompt.test.ts — “byte-identical” invariant only checks absence of the calibration heading, not a snapshot vs. pre-mt#1188

  • Evidence: The test asserts defaultPrompt === explicitNormal and not toContain("## Scope-aware calibration"), but does not compare to a stored snapshot of the pre-change prompt.
  • Risk: A future edit to the normal path could drift unintentionally while still passing this test. This weakens the “no-op invariant” guarantee.
  • Ask: Add a snapshot or a frozen baseline string for the pre-mt#1188 prompt text and assert byte-identity against it for the normal scope.

Spec verification table

  • No task spec provided. All acceptance claims are from the PR description.

1-line README PR receives COMMENT when only non-blocking findings exist

  • Not Met/Needs Verification in integration: The mechanism is present (scope trivial-or-docs adjusts prompt), but the misclassification risk on large-file PRs and missing test directory patterns can incorrectly route scope. Unit tests don’t cover end-to-end behavior. Functionally implemented in prompt.ts; correctness depends on classifier. Recommend additional E2E test or simulate docs-only case.

1-line README PR with a security issue receives REQUEST_CHANGES

  • Needs Verification: Prompt variant includes security as a BLOCKING-allowed category. Behavior likely Met if scope is correctly classified; blocked by the pagination misclassification risk for larger PRs.

100-line refactor retains REQUEST_CHANGES threshold (normal scope = unchanged prompt)

  • Met: buildCriticConstitution default is "normal," and tests assert absence of the calibration section. Risk remains if the opt-out marker is abused.

Log line includes scope: scope field added to review_result JSON

  • Met: services/reviewer/src/server.ts includes scope: result.scope in the review_result log. For skipped reviews, scope is undefined, and the field is still included; acceptable.

Opt-out marker routes through lower-rigor

  • Met: classifyPRScope checks the marker first and returns "trivial." Tests cover this.

Documentation impact

  • Update services/reviewer/README.md or architecture notes to describe:
    • The new scope classifier, precedence rules, and the scope buckets.
    • The opt-out marker , its intended use, and cautionary notes.
    • The new scope field in review_result logs.
    • The reduced BLOCKING categories for trivial-or-docs and test-only scopes (with explicit lists).
  • If you implement pagination or guardrails as requested, document the behavior (e.g., conservative fallback when pagination is incomplete).

Conclusion

REQUEST_CHANGES

Rationale: The pagination omission in fetching changed files can misclassify scope and reduce review rigor on large PRs; and the test-only regex misses common conventions and is case-sensitive, causing incorrect scope classification. Both materially affect behavior and should be fixed before merge. The other points are non-blocking but recommended.

@edobry edobry merged commit 0eb3468 into main Apr 26, 2026
2 checks passed
@edobry edobry deleted the task/mt-1188 branch April 26, 2026 17:18
edobry added a commit that referenced this pull request Apr 27, 2026
…d listFiles truncation safeguard

## Summary

Two reliability improvements to the mt#1188 PR scope classifier (`services/reviewer/src/pr-scope.ts`), both flagged by the reviewer bot on PR #767.

1. **Header-anchor regex for `countChangedLines`** — replace `startsWith("+++")` / `startsWith("---")` with anchored regex on the actual unified-diff file-header form (`+++ a/`, `+++ b/`, `+++ /dev/null` and the `---` equivalents). Content lines whose payload itself starts with `+++` / `---` (Hugo frontmatter delimiters, fenced metadata, ASCII rules) are now counted correctly. Previously they rendered as `++++` / `----` in the diff and were skipped — undercounting line totals and mis-classifying frontmatter-heavy PRs as `trivial`.

2. **listFiles truncation safeguard** — `PullRequestContext` now carries `changedFilesCount: number` sourced from `pr.changed_files`. `classifyPRScope` accepts an optional `changedFilesCount` and falls through to `normal` when `filesChanged.length < changedFilesCount`. This catches cap-exceeded (>1000 files) and error-fallback cases where `fetchListFiles` returns `[]`, plus any future pagination edge where the partial view would mis-classify a PR whose later pages contain code as `docs-only` / `test-only`.

The spec's `per_page: 300` description was already obsolete — the file fetcher (mt#1188 follow-up) already paginates with `per_page: 100` and a 1000-file cap. The new check is defense-in-depth on top of that.

## Key Changes

- `services/reviewer/src/pr-scope.ts`:
  - New `ADDED_HEADER_RE` / `REMOVED_HEADER_RE` constants with `/^\+\+\+ (?:[ab]\/|\/dev\/null)/` anchoring.
  - `countChangedLines` now uses regex test instead of `startsWith` prefix check.
  - `classifyPRScope` input gains optional `changedFilesCount`; truncation check returns `normal` when partial view is detected (sits between empty-files early return and pattern checks).
- `services/reviewer/src/github-client.ts`: `PullRequestContext` gains required `changedFilesCount` field, populated from `pr.changed_files`.
- `services/reviewer/src/review-worker.ts`: threads `pr.changedFilesCount` into the `classifyPRScope` call.

## Testing

- 12 new tests in `pr-scope.test.ts`:
  - 5 under `countChangedLines header anchoring (mt#1270)` — frontmatter `++++` / `----` payloads, the spec's exact synthetic diff, real-header skip regression guard, `/dev/null` header skip.
  - 7 under `pagination/truncation safeguard (mt#1270)` — docs/test downgrade-to-normal, exact match, length-greater (no downgrade), back-compat without the field, opt-out marker precedence, empty-list interaction.
- Full reviewer suite: **402/402 pass** (12 new + 390 pre-existing).
- Typecheck: 0 errors.
- Lint: 0 errors / 0 warnings across 1058 files.
- Format: clean.

Closes mt#1270.

Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorship/co-authored Co-authored by human and AI agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant