Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 10 additions & 22 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@ src/report.ts
src/verify.ts
src/vision-helpers.ts
src/types.ts
src/types-migration.ts
src/gates/
src/runners/
src/store/
src/extractor/
src/parsers/
src/action/
src/action-v2/

# Substrate scripts
scripts/harvest/
Expand All @@ -33,26 +36,9 @@ scripts/filesystem-smoke-test.ts
scripts/message-smoke-test.ts
scripts/sync-backup.sh

# Substrate mvp-migration files (detectors stay, runners don't)
scripts/mvp-migration/agent-corpus*.ts
scripts/mvp-migration/backtest.ts
scripts/mvp-migration/deploy-window-gate.ts
scripts/mvp-migration/django-parser.py
scripts/mvp-migration/django-runner.ts
scripts/mvp-migration/dm28-*.ts
scripts/mvp-migration/full-corpus-reverts.ts
scripts/mvp-migration/historical-followup.ts
scripts/mvp-migration/parse-one.ts
scripts/mvp-migration/prefix-runner.ts
scripts/mvp-migration/replay-engine.ts
scripts/mvp-migration/repo-adapter.ts
scripts/mvp-migration/test-*.ts
scripts/mvp-migration/corpus/
scripts/mvp-migration/fixtures/
scripts/mvp-migration/reports/*
# Published calibration evidence — un-ignore files MEASURED-CLAIMS.md references.
!scripts/mvp-migration/reports/calibration-postfix-2026-04-12.jsonl
scripts/mvp-migration/corpus/_repos/
# Substrate mvp-migration files (whole subtree — migration-era, no longer
# part of the public Gate A receipt surface)
scripts/mvp-migration/

# Substrate data
data/
Expand Down Expand Up @@ -96,11 +82,13 @@ bun.lock
tsconfig.json
tsconfig.build.json

# Dist (except the Action bundle which is tracked)
# Dist (except the Action bundle and its slim calibration which are tracked)
dist/*
!dist/action/
!dist/action/index.cjs
!dist/action/libpg-query.wasm
!dist/action/calibration/
!dist/action/calibration/shapes.json
!dist/action/calibration/attempts.jsonl

# Nightly workflow (private)
.github/workflows/nightly-improve.yml
Expand Down
56 changes: 31 additions & 25 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,49 @@
# CLAUDE.md — verify (public product surface)
# CLAUDE.md — Born14/verify (public release)

Orientation for any Claude session opening this repo.
Orientation for any Claude session that opens this repo. Read this before changing anything.

## What this repo is

This is the **public product surface** for verify. It contains:
This is the **public release** of the Verify GitHub Action. It is a product surface, not a development workspace.

- The shipped GitHub Action (`dist/action/index.cjs`)
- The Action's manifest (`action.yml`)
- The README and methodology documentation
- The published calibration registry (`calibration/shapes.json`, `calibration/corpora.json`, `calibration/attempts.jsonl`)
- Published calibration evidence (`scripts/mvp-migration/reports/calibration-postfix-2026-04-12.jsonl` — the 19 DM-18 findings)
- Readable detector source (the key files that back the shipped precision claim)
Everything here is one of three things:

## Where development happens
- The Action itself: [action.yml](action.yml) and the bundled runtime at [dist/action/index.cjs](dist/action/index.cjs).
- User-facing docs: [README.md](README.md), [METHODOLOGY.md](METHODOLOGY.md), [docs/GITHUB-ACTION-MVP.md](docs/GITHUB-ACTION-MVP.md), [docs/VERIFY-RECEIPT-SAMPLE.md](docs/VERIFY-RECEIPT-SAMPLE.md).
- The public calibration ledger: [calibration/shapes.json](calibration/shapes.json), [calibration/attempts.jsonl](calibration/attempts.jsonl), [calibration/corpora.json](calibration/corpora.json).

**Development does NOT happen here.** All substrate work — new detectors, corpus scanning, experiments, planning, harness work — lives in the private `Born14/verify-engine` repo at `c:/Users/mccar/verify-engine`.
That is the entire surface. If something else is here, it is either build output or a leftover from a previous era and should be cleaned up rather than extended.

The flow is: build and test in verify-engine → rebuild the Action bundle → copy `dist/action/index.cjs` here → update README / registry entries if the user-visible surface changed → commit and push both repos → move `v1` tag in this repo when a user-facing change ships.
## What the Action does

If a future Claude session lands here and is asked to add a detector, implement a feature, run tests, or do calibration work, the correct response is: **switch to the verify-engine repo.** This repo should only receive: the Action bundle after rebuild, README/docs updates, calibration registry entries for newly-calibrated shapes, and published evidence artifacts.
It posts a PR change receipt showing what was checked, what was found, and what was not checked. The receipt also pins the result to a SHA-256 digest. Coverage is Kubernetes, Dockerfile, and GitHub Actions; seven checks at the moment, each one calibrated against a pinned third-party corpus. The full list lives in [README.md](README.md) and ships in every receipt.

## What verify is (brief)
## Where the work happens

Verify is a **harness for agent output** with a published calibrated taxonomy of failure shapes. It currently deploys as:
Development does not happen in this repo. The detectors, calibration corpora, rubrics, and experiments live in a separate private repo (`Born14/verify-engine`) on the operator's machine. When a new shape calibrates or the Action's behaviour changes, the flow is:

1. A GitHub Action that runs on every PR touching SQL migration files. Shipped.
2. A Claude Code CLI hook (in development, v0 being built). Not yet shipped as of 2026-04-17.
1. Build and test in the engine repo.
2. Rebuild the Action bundle there.
3. Copy the new bundle and the slim calibration files into this repo.
4. Update the README and the public ledger if user-visible behaviour changed.
5. Commit and push. Move the `v1` tag only after the new bundle has been smoke-tested on a real PR.

The shipped shape is DM-18 (NOT NULL without DEFAULT), calibrated at 19 TP / 0 FP / 0 ambiguous on 761 production migrations. Evidence JSONL is published at `scripts/mvp-migration/reports/calibration-postfix-2026-04-12.jsonl` and is independently verifiable.
If a Claude session lands in this repo and is asked to write a detector, run a calibration, or add a feature: the right answer is to switch to the engine repo. This repo only receives finished, calibrated output.

DM-28 (deploy-window race) runs at INFO severity in the Action — it surfaces past revert patterns in the repo's migration history as a "Historical context" section in the PR comment. Never blocks. Uncalibrated; first calibration attempt held-to-bar at 28.6%.
## What you can safely do here

## Load-bearing conventions
- Edit user-facing documentation (README, METHODOLOGY, the docs/ files) for clarity.
- Update copy in [action.yml](action.yml).
- Replace the Action bundle when the engine ships a new build.
- Refresh [calibration/](calibration/) when a new attempt ratifies in the engine.

- **The calibration ledger is the primary asset.** Every shape has honest status (calibrated / held-to-bar / shipped / designed). Held-to-bar negatives are recorded as prominently as promotions.
- **Deterministic, not LLM-as-judge.** No LLM runs in the check path.
- **Documentation stance.** Posts to public channels are timestamps on work, not pitches. Don't add "marketing"-style copy to the README or `action.yml`. Match the tone of the existing files: honest, specific, falsifiable.
## What you should not do here

## If in doubt
- Add new detectors, gates, or runtime logic.
- Run calibration measurements.
- Restore or reintroduce files from earlier eras (DM-18 migration detector, 26-gate pipeline, harness, etc.). Those have moved out of this repo intentionally.
- Make claims in the README or `action.yml` that are not backed by a row in [calibration/attempts.jsonl](calibration/attempts.jsonl). Every precision number in user-facing copy must trace to a ledger row.

Open `c:/Users/mccar/verify-engine/CLAUDE.md` for the full development context. Most work doesn't belong in this repo.
## Tone

Plain, specific, falsifiable. Match the existing copy. The product's promise is that every claim is checkable; the docs need to honour that. No marketing voice, no comparative jabs, no claims that go beyond what the ledger supports.
92 changes: 47 additions & 45 deletions METHODOLOGY.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,74 @@
# Verify Methodology
# Verify methodology

How Verify's claims are made and how you can check them.
How Verify makes its claims, and how you can check them.

## The problem
## What Verify does

AI agents and humans write database migrations that are syntactically valid but operationally unsafe. A migration that adds `NOT NULL` without a `DEFAULT` will succeed in development (empty table) and fail in production (millions of rows). No test suite catches this. No code reviewer sees it without knowing the production schema state.
Verify reads the files changed by a pull request, runs a small set of structural checks against them, and posts a single receipt summarizing what it found. The receipt names:

Verify catches these failures deterministically by parsing the migration SQL against the accumulated schema state from prior migrations.
- which checks ran,
- which fired and where,
- which ran and were clear,
- what was deliberately not checked, and
- a SHA-256 digest pinning the result to a specific commit.

## Why deterministic

A deterministic detector produces the same output for the same input, every time. No randomness, no model calls, no "confidence scores." When Verify says a migration is unsafe, the reason is a specific SQL pattern matched against a specific schema state. You can read the detector source, trace the logic, and agree or disagree.

This property is what allows Verify to sit in a blocking CI gate. Probabilistic tools (LLM-based code review) produce false positives that vary between runs. Engineers disable them within a week. Deterministic tools produce consistent verdicts that engineers can evaluate once and trust going forward.
The receipt is the product. Everything else is supporting evidence.

## The tier lifecycle

Every failure shape in Verify's taxonomy has a tier that tells you how much to trust it.

### Observed
## Why deterministic

A failure pattern has been identified and named. No detector exists yet. The shape lives in the taxonomy as a candidate for future development.
The checks are deterministic. Same files in, same receipt out, every time. No machine learning model in the check path, no random sampling, no "confidence score." When a check fires, the reason is a specific structural pattern in the file. You can read the detector source, trace the logic, and decide for yourself whether you agree.

### Shipped
That property is what allows the receipt to be useful in CI. Probabilistic tools produce different verdicts on different runs and engineers turn them off within a week. A deterministic receipt produces a verdict you can evaluate once and trust going forward.

A detector exists, has been tested against internal fixtures, and runs in the GitHub Action. Shipped shapes produce **warnings** in PR comments but do not block merges. They may produce false positives -- that's expected and acceptable at this tier.
## What "calibrated" means here

### Calibrated
Every check Verify ships goes through the same pipeline before it lands in a receipt:

The detector has been measured against a real-world corpus of production-merged migrations. The measurement produces a precision number (true positives vs false positives) and is published in the calibration registry. Only calibrated shapes with acceptable precision are promoted to **blocking** severity.
1. **A pre-registered rubric** is written before any measurement runs. It says exactly what the check should fire on, what counts as a true positive, what counts as a false positive, and what counts as ambiguous. Once measurement starts, the rubric does not move.
2. **A pinned third-party corpus** is selected. The corpus is a real open-source codebase frozen at a specific commit. Synthetic fixtures do not count.
3. **The detector runs** against the corpus and emits findings.
4. **Every finding is classified** against the rubric as true positive, false positive, or ambiguous.
5. **A precision number is computed:** true positives divided by (true positives + false positives).
6. **The attempt is recorded** in [calibration/attempts.jsonl](calibration/attempts.jsonl) — whether it promoted, whether it held to the bar, or whether it failed.
7. **A check is promoted to "calibrated"** only if the precision clears a pre-set threshold on the corpus.

Calibration is the gate for blocking merges. Nothing else.
Recording held-to-bar attempts as prominently as successful ones is the discipline that makes the ledger trustworthy. Anyone can publish wins. Publishing the misses is what proves the bar is real.

## The calibration bar
## The promotion paths

To promote a shape from shipped to calibrated:
A check can promote in one of three ways. Each is defined before measurement; none is invented after the fact.

1. **Pre-register the bar.** Before running the measurement, write down what precision is required for promotion. The bar does not move after the run.
2. **Select a corpus.** The corpus must be production-merged migrations from real open-source projects. Synthetic fixtures do not count.
3. **Run the detector against the corpus.** Record every finding.
4. **Label every finding.** Each finding is manually reviewed and labeled true positive, false positive, or ambiguous by the author.
5. **Compute precision.** True positives / (true positives + false positives).
6. **Record the attempt.** The attempt is recorded in [attempts.jsonl](calibration/attempts.jsonl) regardless of outcome -- including failures and held-to-bar negatives.
7. **Promote or hold.** If precision meets the pre-registered bar, the shape is promoted to calibrated and its severity changes to blocking. If not, the shape stays at shipped/warning and the held-to-bar negative is published.
- **Two-corpus standard.** The check clears the precision threshold on at least two independently-pinned corpora, with ambiguity below 50% on each. This is the default path; it shows the check generalizes.
- **Strong-single-corpus.** The check clears the threshold on one corpus with at least 30 findings and ambiguity below 40%. This path exists for shapes whose base rate is naturally low across most corpora — rejecting them outright would hide real signals.
- **Aggregate-rare-signal.** The check is summed across corpora with a tighter precision floor (95%) and a tighter ambiguity cap (25%). This path requires the rubric to declare aggregate evaluation in advance — it cannot be invoked after the data is in.

Publishing held-to-bar negatives is the discipline that makes the registry trustworthy. Anyone can publish successes. Publishing failures proves the bar is real.
## The published ledger

## The calibration registry
Three files in this repo:

Three files, all public:
- [calibration/shapes.json](calibration/shapes.json) — every shape Verify ships, its current tier, its severity in the receipt.
- [calibration/corpora.json](calibration/corpora.json) — every corpus referenced by a calibration attempt, with the source repo and the pinned commit SHA.
- [calibration/attempts.jsonl](calibration/attempts.jsonl) — every calibration attempt: shape, corpus, precision, ambiguity, disposition, and the reason for the disposition.

- **[shapes.json](calibration/shapes.json)** -- every shape in the taxonomy, its current tier, its detector status, and its severity in the Action.
- **[corpora.json](calibration/corpora.json)** -- every corpus used for calibration, its sources, its limitations, and its suitability for specific shapes. Includes commit SHAs for reproducibility.
- **[attempts.jsonl](calibration/attempts.jsonl)** -- every calibration attempt. Each line records: the shape, the corpus, the date, the precision, the disposition (promoted or held-to-bar), and the reason.
Detector source and per-finding evidence stay private. The aggregate counts and dispositions are public so the receipt's claims are independently checkable from the ledger alone.

Per-finding evidence (individual TP/FP labels for each finding) is kept private. The aggregate counts and dispositions are public.
## Reproducing a receipt

## Reproducing a claim
The receipt is byte-deterministic. Identical inputs (scan root, source commit, generated-at timestamp, Action bundle version) produce a byte-identical artifact.

Every calibrated shape has a reproducibility section in [MEASURED-CLAIMS.md](scripts/mvp-migration/MEASURED-CLAIMS.md) that tells you how to re-run the measurement yourself. The corpus sources are public repositories. The detector source is readable. The schema replay logic is in [schema-loader.ts](scripts/mvp-migration/schema-loader.ts).
```
git clone <repo> && cd <repo> && git checkout <commit>
bun scripts/iac/change-receipt/cli.ts . \
--out .verify --repo owner/name --pr 123 --source-commit <sha>
```

If you get different numbers, [open an issue](https://github.com/Born14/verify/issues). The claim is falsifiable by design.
If you get a different digest, the inputs differ — file an issue with what you changed.

## What Verify is not

- **Not a security scanner.** Verify does not check for SQL injection, secrets, or vulnerabilities.
- **Not a code reviewer.** Verify does not read application code or evaluate logic.
- **Not a linter.** Verify does not check SQL style or formatting.
- **Not a migration runner.** Verify does not execute migrations. It parses them statically.
- Not a security scanner. Verify does not check for secrets, vulnerabilities, or runtime cloud state.
- Not a code reviewer. Verify does not read application code or evaluate logic.
- Not a linter. Verify does not check style.
- Not a complete-coverage tool. Verify checks a small set of calibrated shapes and names everything else explicitly in the receipt's "Not checked" block.

Verify checks one thing: whether a database migration is structurally safe to run against a production schema. It checks this deterministically, publishes its precision, and lets you verify the claim yourself.
The product is the receipt: a short, honest record of what Verify did and didn't do, pinned to a digest you can verify yourself.
Loading
Loading