[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971
[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971hari-kuriakose wants to merge 15 commits into
Conversation
Introduces a single tox-driven entry point that brings every existing test suite under one rig, adds an e2e tier on top of the existing docker-compose stack, and surfaces critical-path gaps + regressions in CI reports. - tests/groups.yaml declares 15 groups across unit/integration/e2e with a depends_on graph; tests/critical_paths.yaml registers 10 critical flows. - tests/rig is the dispatcher: python -m tests.rig run [groups|all|--from-file] [--tier] [--changed-only] [--coverage/--no-coverage] [--parallel]. - E2E supports docker compose (CI default) or testcontainers + local subprocesses (dev default), selectable via UNSTRACT_E2E_RUNTIME. - Reports: per-group junit.xml + md, plus reports/summary.md with regressions and critical-path gaps; htmlcov + coverage.xml when --coverage. - New ci-test-e2e.yaml lane runs on main, run-e2e-labeled PRs, and nightly; ci-test.yaml rewritten to call the rig and update the main baseline cache. - Pre-commit hook fires only when developers create .test-selection. - Existing per-service tests stay co-located; legacy tox envs (runner, sdk1, prompt-service) preserved as thin aliases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Python "tests.rig" test rig: group and critical-path manifests and loaders, selection resolution, E2E runtime drivers, a CLI orchestrator to run groups, reporting and coverage combining, CI/workflow integration, pre-commit hook, helper combine script, extensive docs, and self-tests. ChangesTest Rig Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (1)
tests/rig/groups.py (1)
91-94: ⚡ Quick winValidate manifest schema version during load.
The manifest is versioned (
version: 1) butload_groups()does not enforce it. Failing fast on unknown versions avoids silent misparsing when schema evolves.Suggested fix
def load_groups(path: Path | None = None) -> GroupManifest: @@ raw = yaml.safe_load(manifest_path.read_text()) if not isinstance(raw, dict) or "groups" not in raw: raise ValueError(f"{manifest_path}: expected top-level `groups:` mapping") + if raw.get("version") != 1: + raise ValueError( + f"{manifest_path}: unsupported manifest version {raw.get('version')!r}; expected 1" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/groups.py` around lines 91 - 94, load_groups() currently accepts any manifest without checking its version; add an explicit validation after the top-level raw check to read raw.get("version") and if it's missing or not equal to 1 raise a ValueError referencing manifest_path (e.g. "{manifest_path}: unsupported or missing manifest version, expected version: 1"). Make the check inside the same function (load_groups) right before reading defaults/raw["groups"] so unknown schema versions fail fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci-test-e2e.yaml:
- Around line 39-45: The cache step "Restore main-branch test baseline" uses a
fixed key ("unstract-test-baseline-main-e2e") and a wrong restore-key
("unstract-test-baseline-main"); change the key to include a dynamic identifier
(e.g., github.ref or github.sha) so updates create new cache entries (for
example interpolate "${{ github.ref }}-unstract-test-baseline-main-e2e" or "${{
github.sha }}-unstract-test-baseline-main-e2e"), and make the restore-keys use
the correct e2e namespace (e.g., "unstract-test-baseline-main-e2e" as the
prefix) to avoid cross-lane contamination; apply the same change to the other
equivalent step referenced (lines 51-53).
- Around line 57-63: The "Capture docker compose logs on failure" step redirects
output to reports/docker-compose-logs.txt but doesn't ensure the reports/
directory exists; update the failing-step's run sequence to create the reports
directory (e.g., run mkdir -p reports) before executing the docker compose ... >
reports/docker-compose-logs.txt command so log redirection never fails and logs
are preserved.
In @.github/workflows/ci-test.yaml:
- Around line 44-50: The cache key "unstract-test-baseline-main" in the "Restore
main-branch test baseline (for regression detection)" step is constant and
prevents new baseline uploads; change the key to include a run-unique or
branch-specific variable (for example use unstract-test-baseline-main-${{
github.run_id }} or unstract-test-baseline-${{ github.ref_name }}-${{
github.run_id }}) so updates (--update-baseline) will write a new cache, and
keep the existing restore-keys prefix "unstract-test-baseline-" so prior
baselines can still be discovered; apply the same change to the equivalent step
referenced around lines 60-62.
In `@tests/e2e/smoke/test_smoke.py`:
- Around line 17-18: The current smoke test uses a lax assertion
(response.status_code < 500) which allows non-success codes like 404; update the
assertion after the requests.get call that fetches
f"{platform.backend_url.rstrip('/')}/health/" to require a successful response
(e.g., assert response.status_code == 200 or assert response.ok) so the health
endpoint must return an explicit success status to pass the test.
In `@tests/README.md`:
- Around line 13-37: The three fenced code blocks showing the test tree and
CLI/help snippets are missing language identifiers; update the three opening
backtick fences that precede the blocks starting with "tests/" (the directory
tree), "positional GROUPS ∪ --from-file ..." (the CLI/usage snippet), and
"reports/" (the reports directory tree) to use a language tag (use "text") so
the fences read ```text instead of just ``` to satisfy markdownlint MD040.
- Around line 174-180: Update the opening sentence that currently reads
"reports/summary.md has three sections:" to match the enumerated list by
changing "three" to "four" (or alternatively remove the numbered count), so the
header and the listed items (Per-group results, ❌ Regressions, ⚠️ Critical paths
not yet covered, ✅ Covered critical paths) are consistent; locate the phrase
"reports/summary.md has three sections:" in the README and adjust it
accordingly.
In `@tests/rig/cli.py`:
- Around line 265-269: When the JUnit parsing returns None the current logic
(using result, group_results, and overall_exit) silently ignores the failure;
update the block handling result so that if result is None you still treat the
group as failed: append a placeholder/failure entry to group_results (or
otherwise record the failed parse) and set overall_exit = overall_exit or 1 (or
another non-zero exit) so the failure is propagated; keep existing behavior for
valid result objects (preserve the check against result.exit_code).
- Around line 207-297: cmd_run is too large and should be split into three clear
phases to reduce cognitive complexity: extract a plan_phase that performs
manifest/registry validation (using load_groups, cp.load_critical_paths,
cp.validate_registry_against_manifest), resolves groups (resolve,
_is_missing_placeholder) and prepares runnable/skipped lists and runtime
selection (pick_runtime) but does not start the runtime; extract an
execute_phase that receives runnable, manifest, endpoints, reports_dir and runs
each group via _execute_group collecting GroupResult and overall_exit
(preserving the dry_run behavior and parallel/timeout/workers args); extract a
finalize_phase that handles runtime teardown (runtime.down) in a single place,
coverage combine_and_report, baseline loading via cp.load_baseline, evaluation
via cp.evaluate, write_summary, reporting regressions/gaps and optionally
cp.emit_baseline, and returns overall_exit; ensure try/finally still guarantees
runtime.down only if runtime was started and not dry_run, thread through needed
values (runtime, endpoints, group_results, reports_dir, args) and keep existing
exit-code logic and logging messages unchanged.
- Line 60: Create a module-level constant (e.g., PREVIOUS_SUMMARY_FILENAME =
"previous-summary.json") in tests/rig/cli.py and replace the repeated literal
occurrences (referenced in the add_argument call where
pr.add_argument("--baseline", type=Path, default=REPO_ROOT / "reports" /
"previous-summary.json") and the other two occurrences at the same file) with
Path(REPO_ROOT / "reports" / PREVIOUS_SUMMARY_FILENAME) or Path(REPO_ROOT,
"reports", PREVIOUS_SUMMARY_FILENAME) so all uses reference the single constant
and avoid duplication/drift.
- Around line 252-253: The dry-run path currently does a per-plan "continue"
(args.dry_run) but still falls through to the post-plan evaluation that computes
gaps/regressions (the block that inspects the runs/gaps and returns non-zero),
causing false failures; change the control flow so that after planning is
complete (i.e., after the code that builds the runs/plan set) if args.dry_run is
true the function immediately short-circuits to success (return 0 or
sys.exit(0)) and does not execute the gaps/regressions evaluation block (the
code that inspects runs and computes regressions), and remove/replace the
per-iteration continue so dry-run does not leave an empty runs set for later
checks.
In `@tests/rig/coverage.py`:
- Around line 30-36: The subprocess calls that run "coverage combine/xml/html"
should be made path-safe and fail-fast: build the combine arguments using
absolute paths from reports_dir (e.g., list(map(str, [p.resolve() for p in
reports_dir.glob(".coverage.*")]))) and only call subprocess.run for combine
when that list is non-empty; remove cwd=reports_dir and instead pass absolute
output paths like str((reports_dir / "coverage.xml").resolve()) and
str((reports_dir / "htmlcov").resolve()); also set check=True on all
subprocess.run calls so failures raise instead of being silently ignored. Ensure
these changes are applied to the subprocess.run invocations shown (the calls to
subprocess.run for "coverage combine", "coverage xml", and "coverage html").
In `@tests/rig/critical_paths.py`:
- Around line 91-93: The baseline handling for regression evaluation is fragile:
update the previously_covered computation to defensively validate baseline and
its "covered_paths" before using it — check that baseline is a dict, read
covered = baseline.get("covered_paths", []) only if isinstance(baseline, dict),
ensure covered is an iterable/list (otherwise treat as empty), then build
previously_covered = set(str(p) for p in covered if isinstance(p, (str, int)))
so non-list shapes or non-string entries won't raise or silently miscompute
regressions; refer to the previously_covered variable and the
baseline["covered_paths"] access in tests/rig/critical_paths.py when making the
change.
- Around line 50-55: The code currently converts p.get("covered_by") directly to
tuple when building the CriticalPath, which turns a scalar string into a tuple
of characters; before converting, check the type of p.get("covered_by") (call it
covered = p.get("covered_by")) and if covered is not None and not an iterable
sequence like list/tuple, raise a clear ValueError (or TypeError) stating that
"covered_by must be a sequence", otherwise set covered_by=tuple(covered) (or
tuple() when covered is falsy); update the CriticalPath construction site to use
this validated/converted value so downstream validation receives a proper
sequence.
In `@tests/rig/groups.py`:
- Around line 159-165: The validation currently only checks existence for wd
(from g.absolute_workdir()) and each p in g.absolute_paths(), but doesn't ensure
they are inside the repository root; update the checks to resolve symlinks and
enforce repository boundaries: resolve wd and each p via
Path.resolve(strict=False) (to handle non-existing candidates if needed),
resolve the repository root once (repo_root.resolve()), then verify that
resolved_wd and each resolved_p are children of repo_root (using
Path.is_relative_to or comparing parents in a loop); if any resolved path is
outside repo_root, raise the same ValueError(f"group {name!r}: test path does
not exist: {p}") or a clearer message; keep the existing existence check
(p.exists()) but perform the containment check after resolving symlinks.
In `@tests/rig/reporting.py`:
- Around line 48-50: The current check returns None when junit_path doesn't
exist, which hides groups that have exit.txt; change the branch so that if
junit_path.exists() is False but exit_path.exists() is True you return a
zero-count GroupResult (preserving the group's name/status) instead of None.
Locate the code that references junit_path and exit_path and instantiate/return
a GroupResult (using the same constructor/signature used elsewhere in this file)
with an empty test list or zero counts so the group remains visible in
summaries.
- Line 18: Replace the stdlib XML parser import "import xml.etree.ElementTree as
ET" with the hardened parser "from defusedxml import ElementTree as ET" and
ensure all subsequent uses of ET (e.g., ET.parse, ET.fromstring) in the module
use this defusedxml.ElementTree API so JUnit artifacts are parsed with
XXE/XML-bomb protections; install defusedxml as a dependency if not present.
In `@tests/rig/runtime.py`:
- Around line 208-216: The helper _responds currently returns True when the
requests package is missing, which falsely marks services healthy; update the
ImportError branch in the _responds function so that if importing requests fails
it returns False (or raises a clear error) instead of True, keeping the existing
behavior for requests.get and requests.RequestException handling intact;
reference the _responds function to locate and change the ImportError handling.
- Around line 214-216: The health check currently treats any response with
resp.status_code < 500 as healthy, which wrongly accepts 4xx errors; update the
check so only successful responses are considered healthy (e.g., use resp.ok or
test 200 <= resp.status_code < 400) after the requests.get call and before
returning; keep the existing exception handling for requests.RequestException
unchanged but return False on exceptions so failures are not reported as
healthy.
In `@tests/rig/selection.py`:
- Around line 68-69: Replace the silent swallow in the except block that
currently returns an empty set when git diff fails (the except
(subprocess.CalledProcessError, FileNotFoundError): return set() in
tests/rig/selection.py) with raising a RuntimeError that includes the command
context and original exception message so callers can distinguish "no changes"
from a git error; ensure the raised error mentions the --changed-only scenario
and the underlying exception (e.g., include subprocess.CalledProcessError or
FileNotFoundError info) so callers like cmd_expand and cmd_run in cli.py will
surface a clear terminal error.
---
Nitpick comments:
In `@tests/rig/groups.py`:
- Around line 91-94: load_groups() currently accepts any manifest without
checking its version; add an explicit validation after the top-level raw check
to read raw.get("version") and if it's missing or not equal to 1 raise a
ValueError referencing manifest_path (e.g. "{manifest_path}: unsupported or
missing manifest version, expected version: 1"). Make the check inside the same
function (load_groups) right before reading defaults/raw["groups"] so unknown
schema versions fail fast.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 562d1a0f-86eb-454d-a64d-6eb6c1276fef
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (28)
.github/scripts/combine-test-reports.sh.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yaml.gitignore.pre-commit-config.yamlpyproject.tomltests/README.mdtests/__init__.pytests/compose/docker-compose.test.yamltests/conftest.pytests/critical_paths.yamltests/e2e/__init__.pytests/e2e/conftest.pytests/e2e/smoke/__init__.pytests/e2e/smoke/test_smoke.pytests/fixtures/.gitkeeptests/groups.yamltests/integration/__init__.pytests/rig/__init__.pytests/rig/__main__.pytests/rig/cli.pytests/rig/coverage.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/reporting.pytests/rig/runtime.pytests/rig/selection.pytox.ini
|
| Filename | Overview |
|---|---|
| tests/rig/cli.py | Core dispatch layer; _write_synthetic_junit reports phantom passed=1 for empty hurl groups (exit 5). |
| tests/rig/runtime.py | ComposeRuntime and TestcontainersRuntime; x2text readiness probe added; minio image still unpinned. |
| tests/rig/coverage.py | Coverage merge carries prior-tier data forward via .coverage.prior rename; timeout guard prevents hangs. |
| tests/rig/critical_paths.py | Scope-aware regression vs gap distinction; baseline merge uses union semantics. |
| tests/rig/groups.py | Manifest loading with cycle detection and structural platform-gate invariant. |
| tests/rig/selection.py | Group resolver; from_file read twice for tier-intersection detection — minor redundancy, no correctness bug. |
| .github/workflows/ci-test.yaml | Rewritten to use rig tiers with separate baseline cache key from e2e workflow. |
| .github/workflows/ci-test-e2e.yaml | New slow-lane e2e workflow; e2e-smoke cross-tier depends_on still pulls unit groups into this workflow. |
| tests/groups.yaml | 15 groups defined; e2e-smoke cross-tier depends_on contaminates e2e baseline (prior review). |
| tox.ini | Tier-shaped envs with legacy aliases; clean delegation pattern. |
| tests/rig/reporting.py | JUnit parsing and markdown summary generation; handles malformed XML and missing counters correctly. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[tox -e unit/integration/e2e/groups] --> B[tests.rig run]
B --> C[resolve: positional + tier + file + changed-only]
C --> D[manifest.expand: topo-sort + dep closure]
D --> E{needs_platform?}
E -- yes --> F[pick_runtime: compose/testcontainers/local]
F --> G[runtime.up: start stack + wait_ready]
E -- no --> H[for group in runnable]
G --> H
H --> I{runner?}
I -- pytest --> J[uv run --with plugins pytest]
I -- hurl --> K[hurl --test *.hurl]
J --> L[junit.xml + report.md + exit.txt]
K --> M[_write_synthetic_junit + exit.txt]
L & M --> N[parse_junit GroupResult]
N --> O{more groups?}
O -- yes --> H
O -- no --> P[combine_and_report: coverage union]
P --> Q[evaluate: covered / gap / regression]
Q --> R{--update-baseline AND exit==0?}
R -- yes --> S[merge_into_baseline]
R -- no --> T[write_summary]
S --> T
T --> U[overall_exit]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
tests/rig/cli.py:736-747
When `exit_code == 5` (hurl ran but collected no tests), the synthetic JUnit still records `tests="1"` with one `<testcase>`. `parse_junit` then computes `passed = max(1 - 0 - 0 - 0, 0) = 1`, so the summary table shows `⚪ | e2e-hurl | e2e | **1** | 0 | …` — a non-zero passed count for a group that exercised nothing. The aggregate totals row is inflated by one "phantom" pass per empty hurl group. Using `tests="0"` with no `<testcase>` element when there is nothing to report fixes the count without changing the `status = "empty"` outcome (which is driven by `exit.txt`, not the counter).
```suggestion
is_empty = exit_code == 5
is_failure = exit_code != 0 and not is_empty
failures = 1 if is_failure else 0
failure_tag = f'<failure message="hurl exit {exit_code}"/>' if is_failure else ""
name = saxutils.escape(group_name, {'"': """})
# When no tests were collected (exit 5) emit tests="0" with no <testcase>
# element so parse_junit computes passed=0 rather than a phantom 1.
if is_empty:
tests_count = 0
testcase_body = ""
else:
tests_count = 1
testcase_body = f' <testcase classname="{name}" name="hurl-suite">{failure_tag}</testcase>\n'
path.write_text(
'<?xml version="1.0" encoding="UTF-8"?>\n'
f'<testsuite name="{name}" tests="{tests_count}" failures="{failures}" '
f'errors="0" skipped="0" time="0">\n'
f'{testcase_body}'
f"</testsuite>\n"
)
```
Reviews (12): Last reviewed commit: "[MISC] Preserve coverage across sequenti..." | Re-trigger Greptile
Addresses the 5-agent review of PR #1971. Highlights: Critical fixes: - Wire --cov per group via new `coverage_source` field; pytest-cov now actually produces htmlcov/ and coverage.xml. - Switch CI to run unit and integration tiers as separate steps and merge (rather than overwrite) baselines so the second tier doesn't erase the first's covered paths. - Include github.run_id in baseline cache key so actions/cache writes a fresh entry every main build (it only saves on miss). - Hurl exit 5 (no tests collected) now synthesises failures=0 instead of 1. - Move runtime.up() inside try/finally so partial container stacks get torn down on failure. - Always fold subprocess exit codes into overall_exit, even when junit.xml is missing (segfault / OOM / missing binary). Important fixes: - Shim script uses python3 (CI runners don't ship python). - Drop `all` from ci-test-e2e.yaml so the env name matches what runs; resolve() now intersects `all` with --tier rather than unioning. - Disable --md-report under xdist (incompatible). - Add [tool.coverage.paths] mapping so combined XML/HTML resolves correctly across per-workdir runs. - README: testcontainers stub status is now stated honestly; section count fixed. - Honest covered_by: [] on adapter-register-llm. - Add `tox -e rig -- validate` step to both workflows. - Smoke test asserts UNSTRACT_BACKEND_URL matches the fixture URL. - Surface git stderr on --changed-only failures. - Guard parse_junit against malformed/truncated XML and missing counters. - Log testcontainers down() failures instead of swallowing. - ruff format + check clean on tests/. Type / design: - Literal types for Tier, Runner, CriticalPathState, ResultStatus. - ClassVar[str] on PlatformRuntime; freeze GroupResult. - Drop PlatformEndpoints.extras (never read) for typed InfraEndpoints. - Merge PlatformAccess into PlatformEndpoints (single source of truth). - by_id cache in CriticalPathRegistry.__post_init__. - Tighten unit-backend.paths from `[.]` to explicit per-app tests/ dirs. Self-tests: - New unit-rig group covering tests/rig/* logic (25 tests, ~0.1s): cycle detection, unknown dep, topo expansion, tier/runner validation, platform-smoke-dep invariant, real-manifest sanity, evaluate state machine, baseline merge (not overwrite), junit edge cases (testsuites wrapper, exit 5, missing counters, malformed XML), selection precedence (all+tier intersect, empty selection, from-file merging). Verified locally: - `python -m tests.rig validate` → OK (16 groups, 10 critical paths). - `python -m tests.rig run --no-parallel unit-rig` → 25 passed; emits htmlcov/, coverage.xml (42.5%), summary.{md,json}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (7)
tests/rig/runtime.py (2)
265-267:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTighten health criteria to avoid false-positive readiness.
Line 266 accepts 4xx responses as healthy; a misrouted or unauthorized
/healthendpoint can incorrectly pass readiness.💡 Suggested fix
def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return 200 <= resp.status_code < 300 except requests_mod.RequestException: return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 265 - 267, The health check currently treats any status_code < 500 as healthy which allows 4xx responses; in the try block where you call requests_mod.get(url, timeout=2) and assign resp, tighten the success criterion to only accept 2xx responses (e.g., check resp.status_code >= 200 and resp.status_code < 300) and return False for other statuses; keep the existing except requests_mod.RequestException handler unchanged so network errors still return False.
243-247:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t skip readiness checks when
requestsis unavailable.Line 246 currently turns a broken rig environment into a soft pass for readiness, which can mask startup failures until much later.
💡 Suggested fix
try: import requests - except ImportError: - log.warning("`requests` not installed; skipping platform readiness probe") - return + except ImportError as exc: + raise RuntimeError( + "requests is required for platform readiness probes in ComposeRuntime" + ) from exc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 243 - 247, The readiness probe swallows a missing `requests` dependency by logging a warning and returning, which masks startup failures; in the except ImportError block around the `import requests` in tests/rig/runtime.py (the try/except that currently calls `log.warning("`requests` not installed; skipping platform readiness probe")`), change behavior to fail fast: replace the soft return with an explicit failure (e.g., log an error and raise ImportError or RuntimeError with a clear message instructing to install `requests`) so the readiness check cannot be skipped silently.tests/rig/coverage.py (1)
61-63:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake
coverage combinepath handling robust for relativereports_dir.When
reports_diris relative, the currentcwd=reports_dir+ prefixed file args can point to the wrong location (e.g.,reports/reports/.coverage.*), so combine may fail and coverage artifacts are dropped.💡 Suggested fix
result = subprocess.run( - [*base, "combine", *[str(p) for p in files]], - cwd=reports_dir, + [*base, "combine", *[str(p.resolve()) for p in files]], + cwd=REPO_ROOT, capture_output=True, text=True, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/coverage.py` around lines 61 - 63, The coverage combine invocation can end up with duplicated/incorrect paths when reports_dir is relative; update the subprocess.run call around result, base, files, and reports_dir so the file arguments are either basenames when using cwd=reports_dir or are absolute paths and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path and pass file args as [p.name for p in files] when calling subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and pass absolute paths for each file (e.g., map each file through Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no longer produces wrong combined paths. Ensure changes reference subprocess.run, result, base, files, and reports_dir.tests/rig/reporting.py (2)
18-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a hardened XML parser for JUnit artifact parsing.
xml.etree.ElementTreeis not the right default for untrusted CI artifacts. Switch todefusedxml.ElementTreehere.💡 Proposed fix
-import xml.etree.ElementTree as ET +from defusedxml import ElementTree as ET#!/bin/bash # Verify XML parser usage and whether defusedxml is already declared. set -euo pipefail rg -n "xml\.etree\.ElementTree|defusedxml" tests/rig/reporting.py pyproject.toml requirements*.txt tox.ini 2>/dev/null || trueAlso applies to: 73-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/reporting.py` at line 18, Replace the insecure import "import xml.etree.ElementTree as ET" with the hardened parser "import defusedxml.ElementTree as ET" (and update any other occurrences around the other import at lines ~73-74) so JUnit artifact parsing uses defusedxml; also ensure defusedxml is added to project dependencies (pyproject.toml/requirements) so the import resolves at runtime.
67-68:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep groups visible when
exit.txtexists butjunit.xmlis missing.Returning
Nonehere drops the group from summaries, which can hide failed/aborted runs.💡 Proposed fix
junit_path = reports_dir / group_name / "junit.xml" exit_path = reports_dir / group_name / "exit.txt" if not junit_path.exists(): - return None + if not exit_path.exists(): + return None + return GroupResult( + name=group_name, + tier=tier, + exit_code=_read_exit_code(exit_path), + passed=0, + failed=0, + errors=0, + skipped=0, + duration_seconds=0.0, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/reporting.py` around lines 67 - 68, The current check returns None when junit_path doesn't exist, which drops the group from summaries; instead, update the code that reads JUnit to preserve the group when exit.txt exists by not returning None for a missing junit file—use a placeholder/empty JUnit result or a minimal object indicating “no junit” so downstream summary logic still includes the group; modify the branch around junit_path.exists() (the junit_path check and the function that returns its result) to return an empty/placeholder test-report structure or a sentinel that keeps the group visible rather than None.tests/README.md (1)
13-13:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifiers to fenced code blocks.
These fences should use an explicit language (e.g.,
text) to satisfy markdownlint MD040.💡 Proposed fix
-``` +```text tests/ ... -``` +``` -``` +```text positional GROUPS ∪ --from-file lines ∪ --tier filter ∪ --changed-only diff -``` +``` -``` +```text reports/ ... -``` +```Also applies to: 131-131, 162-162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/README.md` at line 13, Fenced code blocks in tests/README.md are missing language identifiers (markdownlint MD040); add the explicit language specifier `text` to each triple-backtick fence that wraps the examples (the blocks containing "tests/", the one with "positional GROUPS ∪ --from-file lines ∪ --tier filter ∪ --changed-only diff", and the block with "reports/"), and apply the same change to the other occurrences noted (around the blocks at the other two locations referenced in the comment).tests/rig/cli.py (1)
319-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-circuit
--dry-runbefore critical-path evaluation.Plan-only mode currently can still return failure from regression/gap evaluation despite executing nothing.
💡 Proposed fix
finally: if runtime is not None and not args.dry_run: print(f"[rig] tearing platform down (runtime={runtime.name})") runtime.down() + if args.dry_run: + return 0 + if args.coverage and not args.dry_run: combine_and_report(reports_dir)Also applies to: 347-379
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/cli.py` around lines 319 - 320, The loop currently checks args.dry_run only after starting critical-path checks, which allows regression/gap evaluation to set failure state; move the args.dry_run short-circuit to the top of the loop (or before any calls that perform regression/gap evaluation) so the iteration immediately continues/returns without invoking those evaluators, and ensure any status-aggregating variables (exit codes, failure flags) are not modified when args.dry_run is true; apply the same change for the other similar block covering the regression/gap evaluation region (the section corresponding to lines 347-379).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/README.md`:
- Line 13: Fenced code blocks in tests/README.md are missing language
identifiers (markdownlint MD040); add the explicit language specifier `text` to
each triple-backtick fence that wraps the examples (the blocks containing
"tests/", the one with "positional GROUPS ∪ --from-file lines ∪ --tier
filter ∪ --changed-only diff", and the block with "reports/"), and apply the
same change to the other occurrences noted (around the blocks at the other two
locations referenced in the comment).
In `@tests/rig/cli.py`:
- Around line 319-320: The loop currently checks args.dry_run only after
starting critical-path checks, which allows regression/gap evaluation to set
failure state; move the args.dry_run short-circuit to the top of the loop (or
before any calls that perform regression/gap evaluation) so the iteration
immediately continues/returns without invoking those evaluators, and ensure any
status-aggregating variables (exit codes, failure flags) are not modified when
args.dry_run is true; apply the same change for the other similar block covering
the regression/gap evaluation region (the section corresponding to lines
347-379).
In `@tests/rig/coverage.py`:
- Around line 61-63: The coverage combine invocation can end up with
duplicated/incorrect paths when reports_dir is relative; update the
subprocess.run call around result, base, files, and reports_dir so the file
arguments are either basenames when using cwd=reports_dir or are absolute paths
and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path
and pass file args as [p.name for p in files] when calling
subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and
pass absolute paths for each file (e.g., map each file through
Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no
longer produces wrong combined paths. Ensure changes reference subprocess.run,
result, base, files, and reports_dir.
In `@tests/rig/reporting.py`:
- Line 18: Replace the insecure import "import xml.etree.ElementTree as ET" with
the hardened parser "import defusedxml.ElementTree as ET" (and update any other
occurrences around the other import at lines ~73-74) so JUnit artifact parsing
uses defusedxml; also ensure defusedxml is added to project dependencies
(pyproject.toml/requirements) so the import resolves at runtime.
- Around line 67-68: The current check returns None when junit_path doesn't
exist, which drops the group from summaries; instead, update the code that reads
JUnit to preserve the group when exit.txt exists by not returning None for a
missing junit file—use a placeholder/empty JUnit result or a minimal object
indicating “no junit” so downstream summary logic still includes the group;
modify the branch around junit_path.exists() (the junit_path check and the
function that returns its result) to return an empty/placeholder test-report
structure or a sentinel that keeps the group visible rather than None.
In `@tests/rig/runtime.py`:
- Around line 265-267: The health check currently treats any status_code < 500
as healthy which allows 4xx responses; in the try block where you call
requests_mod.get(url, timeout=2) and assign resp, tighten the success criterion
to only accept 2xx responses (e.g., check resp.status_code >= 200 and
resp.status_code < 300) and return False for other statuses; keep the existing
except requests_mod.RequestException handler unchanged so network errors still
return False.
- Around line 243-247: The readiness probe swallows a missing `requests`
dependency by logging a warning and returning, which masks startup failures; in
the except ImportError block around the `import requests` in
tests/rig/runtime.py (the try/except that currently calls
`log.warning("`requests` not installed; skipping platform readiness probe")`),
change behavior to fail fast: replace the soft return with an explicit failure
(e.g., log an error and raise ImportError or RuntimeError with a clear message
instructing to install `requests`) so the readiness check cannot be skipped
silently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 74802a43-b372-4ed8-a7bd-f3c2cd741b61
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (26)
.github/scripts/combine-test-reports.sh.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yamlpyproject.tomltests/README.mdtests/compose/docker-compose.test.yamltests/conftest.pytests/critical_paths.yamltests/e2e/conftest.pytests/e2e/smoke/test_smoke.pytests/groups.yamltests/rig/__init__.pytests/rig/__main__.pytests/rig/cli.pytests/rig/coverage.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/reporting.pytests/rig/runtime.pytests/rig/selection.pytests/rig/tests/__init__.pytests/rig/tests/conftest.pytests/rig/tests/test_critical_paths.pytests/rig/tests/test_groups.pytests/rig/tests/test_reporting.pytests/rig/tests/test_selection.py
✅ Files skipped from review due to trivial changes (2)
- tests/rig/tests/conftest.py
- tests/conftest.py
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/rig/main.py
- tests/e2e/smoke/test_smoke.py
- tests/compose/docker-compose.test.yaml
- .github/workflows/ci-test.yaml
- .github/workflows/ci-test-e2e.yaml
- tests/critical_paths.yaml
- pyproject.toml
- tests/rig/groups.py
- tests/rig/critical_paths.py
…egrity Addresses the 5 new findings from the second review pass: N1+N2 (critical) — cross-tier regression false positives: - Namespace baseline caches per workflow (`unstract-test-baseline-ut-*` for ci-test.yaml; `unstract-test-baseline-e2e-*` for ci-test-e2e.yaml). Previously, ci-test.yaml's restore-keys prefix matched the e2e cache, pulling e2e-tier coverage into unit/integration where those groups never ran — every e2e-covered path lit up as a regression. - Add `scope_groups` to evaluate(); paths whose covered_by is fully outside the current invocation's scope are classified as `gap` (out of scope), not `regression`. cmd_run passes ordered groups as the scope. N3+N4 (critical) — silent baseline corruption: - merge_into_baseline and load_baseline now raise BaselineCorruptError on parse failure instead of treating the file as empty. The old behavior would erase the other tier's coverage on the next merge and demote real regressions to gaps on the build that most needed detection. N5 — smoke test tautology: - Old assertion compared platform.backend_url against the env var it was built from — guaranteed to pass. Replaced with a session-sentinel check: the rig stamps UNSTRACT_RIG_SESSION_ID (uuid4, set unconditionally so setdefault can't let a stale value win) into every group's env, and the smoke test asserts its presence. N6 — unit-rig measuring its own tests as covered: - Add [tool.coverage.run].omit for `**/test_*.py`, `**/*_tests.py`, `**/tests.py`. Keeps the source modules under tests/rig/ measurable while excluding the test files themselves. N7 — gate validator silently disabled if smoke renamed: - `_validate_platform_groups_depend_on_gate` now raises when platform groups exist but the gate is undefined. Gate name overridable via `defaults.platform_gate_group` for forks. N8 — private cross-package import: - Promote `_endpoints_from_env` to `PlatformEndpoints.from_env()` classmethod; conftest.py imports the public name. N9 — teardown exceptions masking the original: - Wrap `runtime.down()` in cmd_run's finally with its own try/except so a teardown failure no longer hides the upstream platform-startup error. N10 — InfraEndpoints invariant: - __post_init__ asserts redis_host/redis_port (and rabbitmq_*) are set together; partial specs raise. N11 — comment polish: - Lift "smoke gates every platform group" rationale to a section banner above the e2e tier in groups.yaml. - Remove dead `emit_baseline` alias. - Rewrite unit-backend paths comment as a positive rule. Self-tests: 25 → 31 passing in 0.05s. New tests cover baseline corruption detection (both load and merge), scope-aware regression demotion, gate validation error paths, and custom gate names via defaults. Verified locally: - `python -m tests.rig validate` → OK. - `python -m tests.rig run --no-parallel unit-rig` → 31 pass; coverage XML + HTML emitted at ~32.5% on the rig itself (down from 42% because broader source files are now in scope). - `uvx ruff check --line-length 90 tests/` → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (6)
tests/e2e/smoke/test_smoke.py (1)
33-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire a success response for the smoke health check.
This currently passes on 4xx responses, which can mask a broken deployment path in e2e smoke.
Suggested fix
def test_backend_health(platform) -> None: response = requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) - assert response.status_code < 500, f"backend /health returned {response.status_code}" + assert response.status_code == 200, f"backend /health returned {response.status_code}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/smoke/test_smoke.py` around lines 33 - 35, The health-check assertion in test_backend_health currently allows 4xx responses; change it to require a successful 2xx response by asserting response.status_code is in the 200–299 range (e.g., response.status_code >= 200 and response.status_code < 300) when calling requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) inside the test_backend_health function so failures surface correctly.tests/rig/groups.py (1)
185-190:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce repository-boundary checks for workdir and test paths.
Existence-only validation still allows resolved paths outside
REPO_ROOT.Suggested fix
def _validate_paths(groups: dict[str, GroupDefinition]) -> None: + repo_root = REPO_ROOT.resolve() for name, g in groups.items(): @@ - wd = g.absolute_workdir() + wd = g.absolute_workdir().resolve() + if not wd.is_relative_to(repo_root): + raise ValueError(f"group {name!r}: workdir escapes repo root: {wd}") if not wd.exists(): raise ValueError(f"group {name!r}: workdir does not exist: {wd}") - for p in g.absolute_paths(): + for p in g.absolute_paths(): + p = p.resolve() + if not p.is_relative_to(repo_root): + raise ValueError(f"group {name!r}: test path escapes repo root: {p}") if not p.exists(): raise ValueError(f"group {name!r}: test path does not exist: {p}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/groups.py` around lines 185 - 190, The current validation in the block using g.absolute_workdir() and g.absolute_paths() only checks existence and allows paths outside REPO_ROOT; update the checks to also ensure the resolved workdir and each resolved test path is inside the repository root (e.g., use Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath), and if a path is outside REPO_ROOT raise a ValueError with a clear message referencing the offending path and group name; apply this logic for wd (from g.absolute_workdir()) and every p from g.absolute_paths() to enforce repository-boundary constraints.tests/rig/cli.py (1)
345-346:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-circuit
--dry-runbefore baseline evaluation and gating.Plan-only runs currently fall through into regression/gap checks and can return non-zero without executing any group.
Suggested fix
finally: if runtime is not None and not args.dry_run: @@ except Exception as exc: print( f"[rig] teardown failed (runtime={runtime.name}): {exc}", file=sys.stderr, ) + + if args.dry_run: + return 0 if args.coverage and not args.dry_run: combine_and_report(reports_dir)Also applies to: 382-421
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/cli.py` around lines 345 - 346, Short-circuit handling of the --dry-run flag: in the main CLI loop (where args.dry_run is checked) move/extend the check so that if args.dry_run is true you skip baseline evaluation and gating logic (the calls to evaluate_baseline and gate_groups) and exit successfully (return/exit 0) instead of falling through; ensure you reference and guard both places mentioned (around the current args.dry_run check and the block that calls evaluate_baseline and gate_groups) so plan-only runs never perform regression/gap checks or return non-zero.tests/rig/critical_paths.py (2)
69-75:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
covered_bytype before tuple conversion.If YAML provides a scalar (e.g.,
"unit-api"), this becomes a tuple of characters and corrupts coverage semantics.Suggested fix
from typing import Any, Literal @@ def load_critical_paths(path: Path | None = None) -> CriticalPathRegistry: @@ CriticalPath( id=str(p["id"]), description=str(p.get("description", "")), entry=str(p.get("entry", "")), - covered_by=tuple(p.get("covered_by") or ()), + covered_by=_parse_covered_by(p.get("covered_by"), path_id=str(p.get("id", "<unknown>"))), ) for p in raw["paths"] ) ) + + +def _parse_covered_by(value: Any, *, path_id: str) -> tuple[str, ...]: + if value is None: + return () + if not isinstance(value, (list, tuple)): + raise ValueError(f"critical path {path_id!r}: `covered_by` must be a list") + return tuple(str(v) for v in value)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/critical_paths.py` around lines 69 - 75, The generator building CriticalPath instances incorrectly converts a scalar covered_by (e.g., "unit-api") into a tuple of characters; update the covered_by expression in the CriticalPath construction to validate the type: if p.get("covered_by") is already a list/tuple use tuple(value), if it's a non-empty scalar (like str) wrap it as (str(value),), and if None/empty produce an empty tuple. Locate the generator over raw["paths"] and change the covered_by handling accordingly so only iterable lists/tuples are expanded and scalars become single-element tuples.
121-123:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDefensively validate baseline shape before consuming
covered_paths.Current logic assumes a dict with list-like
covered_paths; malformed but parseable JSON can miscompute regressions or raise unexpected exceptions during merge.Suggested fix
def evaluate( @@ - previously_covered: set[str] = set( - (baseline or {}).get("covered_paths", []) if baseline else [] - ) + covered_raw: list[Any] = [] + if isinstance(baseline, dict): + candidate = baseline.get("covered_paths", []) + if isinstance(candidate, list): + covered_raw = candidate + previously_covered: set[str] = {p for p in covered_raw if isinstance(p, str)} @@ def merge_into_baseline(statuses: list[CriticalPathStatus], destination: Path) -> None: @@ - parsed = json.loads(destination.read_text()) - existing = set(parsed.get("covered_paths") or []) + parsed = json.loads(destination.read_text()) + if not isinstance(parsed, dict): + raise BaselineCorruptError( + f"refusing to merge into corrupt baseline {destination}: " + "expected JSON object with `covered_paths` list. Delete cache and re-run." + ) + raw = parsed.get("covered_paths") or [] + if not isinstance(raw, list): + raise BaselineCorruptError( + f"refusing to merge into corrupt baseline {destination}: " + "`covered_paths` must be a list. Delete cache and re-run." + ) + existing = {p for p in raw if isinstance(p, str)}Also applies to: 169-171
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/critical_paths.py` around lines 121 - 123, The code assumes `baseline` is a dict with a list-like `covered_paths` which can raise or corrupt results; update the logic around the `previously_covered` assignment to defensively validate the shape: check `isinstance(baseline, dict)` and `isinstance(baseline.get("covered_paths"), list)` before using it, coerce into a set filtering only string entries (e.g., {p for p in baseline["covered_paths"] if isinstance(p, str)}) and otherwise use an empty set; apply the same defensive pattern where `covered_paths` is consumed again (the same block referenced at lines 169-171) to avoid exceptions or bad data propagation.tests/rig/runtime.py (1)
264-268:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake readiness probing fail-closed instead of pass-when-broken.
Skipping probes when
requestsis unavailable and accepting 4xx as healthy both create false-positive “ready” states.Suggested fix
def _wait_ready(endpoints: PlatformEndpoints, *, timeout_seconds: int = 300) -> None: @@ try: import requests except ImportError: - log.warning("`requests` not installed; skipping platform readiness probe") - return + raise RuntimeError( + "`requests` is required for platform readiness checks in e2e runtime" + ) @@ def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return 200 <= resp.status_code < 300 except requests_mod.RequestException: return FalseAlso applies to: 287-287
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 264 - 268, The readiness probe currently skips when the requests package is unavailable and treats 4xx responses as healthy; change it to fail-closed by making the ImportError path error out (log.error and return/raise a non-ready result) instead of returning early, and update the HTTP-status handling so only 2xx responses are considered healthy (treat 3xx/4xx/5xx as unhealthy). Specifically, modify the try/except that imports requests to report failure and propagate non-ready status, and adjust the probe logic that currently accepts 4xx responses so it only accepts status codes in the 200-299 range.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/e2e/smoke/test_smoke.py`:
- Around line 33-35: The health-check assertion in test_backend_health currently
allows 4xx responses; change it to require a successful 2xx response by
asserting response.status_code is in the 200–299 range (e.g.,
response.status_code >= 200 and response.status_code < 300) when calling
requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) inside
the test_backend_health function so failures surface correctly.
In `@tests/rig/cli.py`:
- Around line 345-346: Short-circuit handling of the --dry-run flag: in the main
CLI loop (where args.dry_run is checked) move/extend the check so that if
args.dry_run is true you skip baseline evaluation and gating logic (the calls to
evaluate_baseline and gate_groups) and exit successfully (return/exit 0) instead
of falling through; ensure you reference and guard both places mentioned (around
the current args.dry_run check and the block that calls evaluate_baseline and
gate_groups) so plan-only runs never perform regression/gap checks or return
non-zero.
In `@tests/rig/critical_paths.py`:
- Around line 69-75: The generator building CriticalPath instances incorrectly
converts a scalar covered_by (e.g., "unit-api") into a tuple of characters;
update the covered_by expression in the CriticalPath construction to validate
the type: if p.get("covered_by") is already a list/tuple use tuple(value), if
it's a non-empty scalar (like str) wrap it as (str(value),), and if None/empty
produce an empty tuple. Locate the generator over raw["paths"] and change the
covered_by handling accordingly so only iterable lists/tuples are expanded and
scalars become single-element tuples.
- Around line 121-123: The code assumes `baseline` is a dict with a list-like
`covered_paths` which can raise or corrupt results; update the logic around the
`previously_covered` assignment to defensively validate the shape: check
`isinstance(baseline, dict)` and `isinstance(baseline.get("covered_paths"),
list)` before using it, coerce into a set filtering only string entries (e.g.,
{p for p in baseline["covered_paths"] if isinstance(p, str)}) and otherwise use
an empty set; apply the same defensive pattern where `covered_paths` is consumed
again (the same block referenced at lines 169-171) to avoid exceptions or bad
data propagation.
In `@tests/rig/groups.py`:
- Around line 185-190: The current validation in the block using
g.absolute_workdir() and g.absolute_paths() only checks existence and allows
paths outside REPO_ROOT; update the checks to also ensure the resolved workdir
and each resolved test path is inside the repository root (e.g., use
Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare
pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath),
and if a path is outside REPO_ROOT raise a ValueError with a clear message
referencing the offending path and group name; apply this logic for wd (from
g.absolute_workdir()) and every p from g.absolute_paths() to enforce
repository-boundary constraints.
In `@tests/rig/runtime.py`:
- Around line 264-268: The readiness probe currently skips when the requests
package is unavailable and treats 4xx responses as healthy; change it to
fail-closed by making the ImportError path error out (log.error and return/raise
a non-ready result) instead of returning early, and update the HTTP-status
handling so only 2xx responses are considered healthy (treat 3xx/4xx/5xx as
unhealthy). Specifically, modify the try/except that imports requests to report
failure and propagate non-ready status, and adjust the probe logic that
currently accepts 4xx responses so it only accepts status codes in the 200-299
range.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cacbb976-fbc3-43a1-8c48-556550c3014f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yamlpyproject.tomltests/e2e/conftest.pytests/e2e/smoke/test_smoke.pytests/groups.yamltests/rig/cli.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/runtime.pytests/rig/tests/test_critical_paths.pytests/rig/tests/test_groups.py
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/ci-test-e2e.yaml
- pyproject.toml
- tests/groups.yaml
- .github/workflows/ci-test.yaml
R1 (critical) — cmd_run returned early on BaselineCorruptError BEFORE write_summary, so a green build with one corrupt baseline lost all per-group visibility. Restructured so write_summary always runs; baseline-corrupt now just flips overall_exit and falls through. R2 (important) — TestcontainersRuntime.up() built InfraEndpoints AFTER the cleanup try/except, so a __post_init__ failure leaked the four started containers. Moved construction inside the try (and the return with it) so the class invariant is self-cleaning at its construction site. R3 (important) — added tests/rig/tests/test_cli.py covering three previously- untested behaviors of cmd_run: - scope_groups is the dep-expanded selection (guards N2 wiring) - down() failure doesn't mask up() exception (guards N9 try/except) - write_summary runs even when load_baseline raises (guards R1) R4 (important) — evaluate's groups_run_green and scope_groups annotations loosened from set[str] to Collection[str]. Function only reads them (`g in ...`), so accepting frozenset/list/tuple from callers is fine. Hot path still pays the O(1) tax via internal set() conversion. R5 (suggestion, deferred) — InfraEndpoints shape asymmetry (redis/rabbitmq host+port pairs vs. postgres/minio collapsed URL/endpoint strings). Picking a single shape would be cleaner but neither option has a clear win without a real second consumer of the infra block; revisit when one appears. R7 (suggestion) — fixed misleading comment: scope_groups includes dep-expanded groups, not just "what the user asked for". R8 (suggestion) — InfraEndpoints docstring now honestly describes which fields the __post_init__ invariant applies to. R9 (suggestion) — strengthened scope test with a "straddling" path (covered_by=[unit, e2e]) so a weaker implementation checking against groups_run_green instead of covered_by would be caught. R10 (suggestion) — InfraEndpoints redis_port/rabbitmq_port: str → int. Cast at the testcontainers boundary instead of pre-stringifying. Verified: - 34/34 self-tests pass (up from 31). - `python -m tests.rig validate` → OK. - `python -m tests.rig run --no-parallel unit-rig` → 34 pass; coverage emitted. - `uvx ruff check --line-length 90 tests/` → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/rig/runtime.py (1)
288-293: 💤 Low valueHealth check accepts 4xx responses as healthy.
resp.status_code < 500treats 4xx (e.g., 404, 401) as ready. Consider tightening toresp.okor200 <= resp.status_code < 300to catch endpoint misconfiguration early.def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return resp.ok except requests_mod.RequestException: return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 288 - 293, The health check in _responds currently treats any status_code < 500 as healthy, which accepts 4xx responses; update the function (inside _responds where requests_mod.get is called and exceptions are caught) to return only true for successful 2xx responses—use resp.ok or explicitly check 200 <= resp.status_code < 300—instead of resp.status_code < 500, preserving the timeout and existing RequestException handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/rig/runtime.py`:
- Around line 288-293: The health check in _responds currently treats any
status_code < 500 as healthy, which accepts 4xx responses; update the function
(inside _responds where requests_mod.get is called and exceptions are caught) to
return only true for successful 2xx responses—use resp.ok or explicitly check
200 <= resp.status_code < 300—instead of resp.status_code < 500, preserving the
timeout and existing RequestException handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9868dd1c-57b8-4c78-8ef6-d886af0692e6
📒 Files selected for processing (5)
tests/rig/cli.pytests/rig/critical_paths.pytests/rig/runtime.pytests/rig/tests/test_cli.pytests/rig/tests/test_critical_paths.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/rig/critical_paths.py
- tests/rig/tests/test_critical_paths.py
…rage
R4-A (important) — summary.md misrepresented regressions on corrupt baseline.
write_summary now accepts a baseline_corrupt flag; when set, summary.md (and
the sticky PR comment) emit a "Baseline cache was corrupt; regression
detection disabled this run" banner so reviewers don't read "gap" entries as
first-time gaps when they may actually be regressions hidden by the cache
failure. summary.json also exposes the flag.
R4-B (important) — added test_cmd_run_does_not_update_baseline_on_red_build:
proves the `--update-baseline` guard refuses to bake red-build state into
the cache. Without the guard, the next build wouldn't surface regressions.
R4-C (suggestion) — baseline corruption now flips overall_exit regardless of
whether the build was otherwise green. Previously red→corrupt was swallowed.
R4-D (suggestion) — added tests/rig/tests/test_runtime.py covering
InfraEndpoints partial-pair rejection (4 cases) and PlatformEndpoints.from_env
defaults. Closes the zero-coverage gap left by R5 deferral.
R4-E (suggestion) — monkeypatch rationale documented at top of test_cli.py
so the pattern's safety assumptions are explicit for future readers.
R4-F (suggestion) — simplified dead-code ternary in evaluate's
previously_covered initialization.
R4-G (suggestion) — scope_groups docstring "set" → "collection" to match
the R4 signature loosening.
R4-H (suggestion) — replaced internal review-tag references in test
docstrings ("guards N2/N9/R1") with invariant descriptions that stand on
their own post-merge.
R4-I (suggestion) — strengthened the corrupt-baseline test to assert the
summary.md content includes the banner, not just that the file exists; a
zero-byte file would no longer pass.
Verified:
- 40/40 self-tests pass (up from 34).
- `python -m tests.rig validate` → OK.
- End-to-end run with corrupt baseline emits the banner correctly.
- `uvx ruff check/format --line-length 90 tests/` → clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/rig/cli.py`:
- Around line 236-242: The current except block can hide the original exception
from runtime.up() if runtime.down() also raises; modify the handler around
runtime.up()/runtime.down() so you capture the original exception (e.g., store
it as exc_up), then call runtime.down() inside its own try/except, and always
re-raise the original exc_up regardless of whether runtime.down() fails
(optionally log the down() error). Specifically update the try/except that
invokes runtime.up() and runtime.down() to preserve and re-raise the original
up() exception while isolating any errors from runtime.down().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31752086-90af-4796-aeab-f10b9e9a6e59
📒 Files selected for processing (5)
tests/rig/cli.pytests/rig/critical_paths.pytests/rig/reporting.pytests/rig/tests/test_cli.pytests/rig/tests/test_runtime.py
jaseemjaskp
left a comment
There was a problem hiding this comment.
Multi-agent review summary
Ran six review passes (Code Reviewer, Code Simplifier, Comment Analyzer, PR Test Analyzer, Silent Failure Hunter, Type Design Analyzer) against the rig. Existing coderabbitai/greptile threads cover most of the headline bugs; the inline comments below are net-new findings.
Top themes uncovered:
- Silent prep failures —
_prepare_group_envand synthetic-junit writes both fail-soft in ways that letpytestproduce a misleading red/green that does not reflect the actual state of the build. - Subprocess hygiene —
_spawntimeout does not kill the process group, andcoverageinvocations have no timeout. Both can hang or leak processes between groups. - Type invariants —
GroupDefinition.envis mutable inside a frozen dataclass;CriticalPathRegistrysilently last-wins on duplicate ids;CriticalPathStatusallows contradictorystate="covered"with emptycovering_groups_run. - Pre-commit hook regression — uses bare
pythondespitecombine-test-reports.shbeing explicitly fixed forpython3-only environments. - Docstring drift — three docstrings claim guarantees the surrounding code does not actually provide (URL sentinel, x2text wiring, scope-aware re-aggregation in
cmd_report).
Findings overlapping with existing bot threads (e.g. _responds() laxity, smoke < 500, previous-summary.json duplication, cmd_run complexity, JUnit defusedxml hardening) were deliberately dropped to avoid noise.
Low-impact items not posted inline:
runtime.py:67-73x2text_url is declared onPlatformEndpointsbut never exported incli.py:486-490. Either delete the field or wireenv.setdefault("UNSTRACT_X2TEXT_URL", ...).tests/README.md:144-149testcontainers row should warn thatdown()is best-effort and leaked containers can block the next run on port conflict.tests/critical_paths.yaml:22-24perpetual gap TODO has no tracking ticket; consider linking an issue.tox.inipassenv = UNSTRACT_*glob is wide enough to leak adapter API keys into CI logs via compose_run(nocapture_output=True).
No files were modified.
When the rig runs under tox, tox exports VIRTUAL_ENV=.tox/<env>. The rig's per-group `uv run` then warns "VIRTUAL_ENV=.tox/unit does not match the project environment path .venv and will be ignored" before ignoring it anyway — noise in CI logs, and a latent ambiguity since each group is meant to resolve its own workdir's project venv, never tox's. Strip VIRTUAL_ENV (and UV_PROJECT_ENVIRONMENT) from the environment passed to every uv/pytest/coverage subprocess via a shared `_subprocess_env()` in cli.py and `_clean_env()` in coverage.py. uv then does clean per-group project discovery with no warning. Verified: with VIRTUAL_ENV=.tox/unit set, a raw `uv run` emits the warning while `python -m tests.rig run unit-rig` emits zero. 40/40 self-tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical: - B1: XML-escape group name in synthetic JUnit. A group key with "/&/< would produce malformed XML that parse_junit reads as a phantom error on a green hurl run. - B2: pre-commit hook uses `$(command -v python3 || command -v python)` so the opt-in test hook works on stock Ubuntu/containers without a `python` symlink (matches the combine-script fix). Major: - B3: fold junit-reported errors/failures into overall_exit. A truncated junit (errors=1) with pytest exit 0 rendered ❌ but exited green. - B4: coverage subprocesses (combine/xml/html) now run with timeout=300. They run after the per-group loop so per-group timeouts don't apply; a slow `uv run --with` resolve could otherwise hang to the CI ceiling. - B5: --changed-only detects HEAD == base (push-to-main) and falls back to HEAD~1...HEAD with a clear message instead of selecting nothing and exiting 2. - dry-run no longer fails on critical-path gaps/regressions: no groups ran, so every covered path looks uncovered; a dry run is plan-only. - Q1: duplicate critical-path ids now raise at load instead of silent last-wins. - Q2: CriticalPathStatus enforces covered⇒non-empty / gap|regression⇒empty covering_groups_run, making the contradictory state unrepresentable. - Q3: GroupDefinition.env coerced to MappingProxyType in __post_init__ so the frozen record can't be mutated through its dict. - Q5: synthetic-junit write wrapped in try/except OSError, matching exit.txt. Minor: - Q4: reworded _rig_session_id docstring — the sentinel proves the rig ran; URL ownership is intentionally cooperative via setdefault. Tests: 40 → 46. New coverage: synthetic-junit escaping + exit-5, duplicate-id rejection, status-invariant enforcement, frozen env, and cmd_report re-aggregation (previously zero coverage). Deferred to follow-up (replied on threads): _spawn process-group kill on timeout (heavy Popen refactor), cmd_run cognitive-complexity split. Verified: 46/46 self-tests pass; validate OK; end-to-end emits coverage.xml + htmlcov; ruff check/format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PlatformEndpoints collects x2text_url from UNSTRACT_X2TEXT_URL but the rig never forwarded it to group subprocesses and never probed it for readiness: - _execute_group now setdefaults UNSTRACT_X2TEXT_URL into the group env, so e2e tests hitting x2text use the rig's URL instead of a stale/default value. - _wait_ready now includes x2text /health, so ComposeRuntime.up() waits for it rather than returning early and letting the first x2text call hit ConnectionRefusedError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the failing CI test check, SonarCloud hotspot, and the open Greptile P1 review thread. - optional: true now also means non-blocking exit. Such groups still run and surface red in summary.md, but no longer gate the overall exit code. This is what was failing CI: unit-connectors runs live-DB tests that need infra we don't provision in CI. Honors developer intent for groups that need unavailable infra or are pluggable/cloud-only. - scope_groups now uses set(runnable) not set(ordered): skipped optional placeholders are excluded from scope so their critical paths classify as gap, not regression (Greptile P1 — cross-tier false positives). - SHA-pin astral-sh/setup-uv@v7 -> v7.6.0 commit in both test workflows (SonarCloud new_security_hotspots_reviewed). - Fix docker_client_with_sidecar fixture to mock DockerClient.from_env, mirroring the sibling docker_client fixture — the constructor reached for a real daemon (green on CI runners, red on daemon-less environments). Self-tests: 48 pass (+2 new: optional-failure-non-blocking, missing-placeholder-excluded-from-scope). tox -e unit exits 0 with optional groups still showing red in the summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hari John Kuriakose <hari@zipstack.com>
The docstring on _validate_platform_groups_depend_on_gate and the matching groups.yaml comment promised that a failed gate group (e2e-smoke) makes its dependents "skip cleanly rather than running against a half-up stack." That runtime skip is not implemented — cmd_run runs every runnable group in topo order unconditionally. Today it's moot (all platform dependents are optional: true placeholders, so a red gate gates nothing), but the docs overpromised. Foundation PR, so correct the docs rather than rush the runtime feature: - Reframe the validator docstring as the structural invariant it actually enforces (the dep edge must exist; gate runs first in topo order). - Align the groups.yaml comment. - Add a TODO in cmd_run describing the deferred runtime skip-on-dep-failure and the design questions it raises (optional/non-failing-exit cascade, transitive depth) so the future PR commits to answers deliberately. No behavior change. Self-tests: 48 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
combine_and_report runs once per tier (tox -e unit, then tox -e integration, then the final cmd_report combine), and `coverage combine` consumes its inputs. The old code unlinked the existing reports/.coverage before combining only the current tier's .coverage.* files, so each tier's combine discarded the previous tier's merged data — the final coverage.xml uploaded to SonarCloud reflected only the last tier. Latent today (integration groups are optional exit-5 stubs that write no coverage) but fires the moment any integration group produces data. Fix: carry the prior tier's merged data forward by renaming reports/.coverage to a suffixed .coverage.__prior__ and feeding it back in as a combine input. The rename is required — `coverage combine` only picks up suffixed data files, never a bare .coverage (verified: `combine .coverage .coverage.x` reports "Combined 1 file" and drops the .coverage data). On success combine consumes __prior__ too, so no stale file accumulates; if combine fails after the rename, __prior__ remains and the next call's glob picks it up — no data loss. Adds tests/rig/tests/test_coverage.py (4 tests): cross-tier preservation, same-file line union, no-files no-op, and idempotent re-combine with only prior data (the cmd_run + cmd_report double-combine case). Self-tests: 52 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Unstract test resultsPer-group results
Critical paths
|



Summary
This is the foundation only — no new tests beyond a single e2e smoke. Existing co-located unit tests stay where they are; the rig just discovers and runs them.
What's new
tests/groups.yaml— single source of truth for 15 groups (9 unit, 1 integration placeholder, 5 e2e) with adepends_ongraph.tests/critical_paths.yaml— 10 critical flows registered, each pointing at the group(s) that cover it.tests/rig/— the dispatcher (python -m tests.rig run|list-groups|expand|validate|platform|report).tests/e2e/—platformsession fixture + one smoke test exercising the e2e plumbing end-to-end.tox.ini— rewritten with tier-shaped envs (unit,integration,e2e,groups,rig); legacyrunner|sdk1|prompt-serviceenvs preserved as thin aliases.ci-test.yamlrewritten to call the rig; newci-test-e2e.yamlslow lane (main +run-e2e-labeled PRs + nightly).unstract-testsfires only when developers opt in by creating.test-selection.Usage
Full docs:
tests/README.md.Reports
reports/<group>/{junit.xml, report.md, exit.txt}reports/{summary.md, summary.json, combined-test-report.md}— samecombined-test-report.mdpath keeps the existing sticky-PR-comment workflow working.--coverage):reports/{coverage.xml, htmlcov/}—coverage.xmlis consumable by the existing SonarCloud integration.Critical paths
CI on
mainruns with--fail-on-critical-gap --update-baseline, so:Test plan
python -m tests.rig validatesucceeds (OK — 15 groups, 10 critical paths).python -m tests.rig list-groupsandexpand <group>print correct dep-expanded ordering.validatecorrectly fails on an injected unknown-group reference incritical_paths.yaml.python -m tests.rig run --no-coverage --no-parallel unit-coreend-to-end: creates per-group venv, installs rig pytest plugins, runs pytest, writesjunit.xml+report.md+exit.txt, and aggregates intosummary.md/summary.jsonwith critical-path gaps section.bash .github/scripts/combine-test-reports.shshim works (delegates topython -m tests.rig report combineand preserves the legacycombined-test-report.mdpath).ci-test.yamlon this PR and confirm the sticky comment renders.run-e2eto smoke-testci-test-e2e.yamlagainst docker compose.Notes for reviewers
unstract/core/tests/(LogHelpernot exported;account_servicessubmodule missing). Those are not rig issues — the rig correctly captured them as errors. Fixing them is out of scope here.unstract/connectors/uv.lockhad unrelated churn (litellm source switch); intentionally left out of this PR.tests/.🤖 Generated with Claude Code