Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions .agents/skills/ln-witness/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
name: ln-witness
description: "Audit a test suite for what it actually proves — attribute tests to behavioral kernels, place each on the progressive-checkability ladder, surface unwitnessed proof obligations, and generate contrastive rivals the tests fail to rule out. Use when a slice has tests but verification confidence is unclear, when tests pass but the spec feels under-witnessed, or when the user asks what these tests prove."
argument-hint: "[test files, directory, or frontier item to audit; include relevant invariants or kernels if known]"
---

# Ln Witness

Audit a test suite for evidentiary strength, not line coverage.

A passing test is not yet a witness. A witness is a test that exercises a named claim under the mutations its kernel cares about, sits on a known rung of the progressive-checkability ladder, and rules out at least one plausible rival interpretation. `ln-witness` makes those three properties explicit and surfaces where they are missing.

Sibling skills:
- [`ln-oracles`](../ln-oracles/SKILL.md) chooses verification strategy *before* tests exist; `ln-witness` audits what the resulting tests actually prove.
- [`ln-design`](../ln-design/SKILL.md) forces rival module shapes into view before commitment; `ln-witness` forces rival *interpretations* into view to expose witness gaps. Same epistemic stance, different artifact.
- [`ln-review`](../ln-review/SKILL.md) audits code shape; `ln-witness` audits test evidence.

Do not create standalone audit documents without explicit permission. Findings reconcile back into `memory/SPEC.md` (invariants, blind spots), `memory/PLAN.md` (frontier verification notes), or the active scope card.

Read the [witness rubric](assets/witness-rubric.md) before starting. It defines the progressive-checkability ladder and the proof obligations per behavioral kernel.

## Input

What to audit: $ARGUMENTS

Read in this order:
1. `memory/SPEC.md` — invariants, acceptance criteria, verification design, and any §Active Kernels notes.
2. `memory/PLAN.md` — frontier definitions whose verification notes reference the tests under audit.
3. `docs/design/BEHAVIORAL_KERNELS.md` — the kernel taxonomy, proof obligations, and example test shapes the audit will measure against.
4. The test files themselves and the code under test.

If no intent graph or kernel annotation exists yet, the audit drops to heuristic mode: infer active kernels from test names, assertions, and code under test, and flag this softness in the report.

## Procedure

This is an **interactive process**. Present each step's findings and grill the user before moving on. Do not produce a finished audit in one pass.

### 1. Identify active kernels

For the tests in scope, name which of the fifteen behavioral kernels appear to be exercised. Use the signal-phrase routing table from `BEHAVIORAL_KERNELS.md` against test names, assertion targets, and the code under test. Keep the active set small (typically two to four kernels per scope).

**Grill**: Are these the kernels the spec considers active for this scope, or has the test suite drifted toward a different set than the spec intends? If the spec is silent on active kernels, is this an opportunity to promote them into `memory/SPEC.md`?

### 2. Audit mode — attribute and rate

For each test in scope, fill three columns:

| Column | Question |
| --- | --- |
| **Kernel** | Which kernel(s) does this test probe? `none` is a valid answer (incidental regression catcher). |
| **Witnessed claim** | Which `memory/SPEC.md` invariant, criterion, or example item does this test witness? `none` is valid but should be flagged. |
| **Ladder rung** | Where on the [progressive-checkability ladder](assets/witness-rubric.md) does this test sit? (positive example → counterexample → regression → property → runtime contract → state-machine rule → invariant → proof obligation) |

Present the table to the user. Tests with `kernel: none` or `witnessed claim: none` are either incidental or symptoms of spec gaps; both deserve a sentence of justification.

**Grill**: For tests stuck at the "positive example" rung, ask: is this the appropriate strength, or should it be promoted to a property/contract? For tests with no witnessed claim, ask: should the claim be added to `memory/SPEC.md`, or is this test load-bearing only as a regression net?

### 3. Audit mode — unwitnessed obligations

For each active kernel, list its canonical proof obligations from the [rubric](assets/witness-rubric.md) and mark which are covered by at least one test at the appropriate rung. Examples:

- Containment & topology: `add / move / delete / reorder preserves topology` — four obligations.
- State & lifecycle: `every state reachable / every transition exercised / terminal states are sinks / forbidden transitions rejected` — four obligations.
- Authority & capability: `permitted action succeeds / forbidden action rejected / delegated capability flows / revocation propagates` — four obligations.

A kernel with three of four obligations covered is an honest report; a kernel with one of four covered is a finding.

**Grill**: For each gap, ask: is this an acceptable deferral (cost exceeds value, deferred to outer loop, not currently in scope) or a real blind spot? What would trigger writing the missing test?

### 4. Rivalry mode — contrastive alternatives

This is where `ln-witness` inherits the design-it-twice DNA. For each *witnessed* invariant, generate two to four plausible rival interpretations the test suite would also satisfy. Borrow the contrastive-question shapes from `BEHAVIORAL_KERNELS.md` §Contrastive questions.

Worked shape:

```
Invariant: Deleting a project archives its tasks.

Tests witness this by asserting: after delete, tasks.status === 'archived'.

Rivals the tests fail to rule out:
A. Tasks are archived only at the moment of delete; later mutations to
the deleted project's tasks are silently accepted.
B. Archived tasks remain editable through the API even though the UI
hides them.
C. Archive is a soft delete; a second delete on the project hard-deletes
the tasks without warning.

Discriminating tests:
→ For A: mutate an archived task after parent delete; assert rejection.
→ For B: attempt PATCH on archived task; assert 403 or equivalent.
→ For C: double-delete the project; assert second call is idempotent.
```

Present the rivals to the user. Each rival is one of: **close** (write the discriminating test), **accept** (mark the invariant as under-witnessed with explicit scope), or **escalate** (the rival reveals real spec ambiguity — route back to `ln-disambiguate` or `ln-spec`).

**Grill**: For each rival, ask: is this a plausible interpretation in this domain, or a strawman? Plausibility matters — rivalry mode loses value if the rivals are not interpretations a reasonable reader would actually entertain.

### 5. Reconcile findings

Aggregate the audit into three buckets:

- **Strong witnesses** — tests at property/contract/invariant rung tied to named claims with no plausible uneliminated rivals.
- **Weak witnesses** — tests at example/regression rung, or tied to claims with uneliminated rivals, where promotion is feasible and worth it.
- **Honest gaps** — unwitnessed obligations and uneliminated rivals the user explicitly accepts as deferrals, with revisit triggers.

A test suite with zero weak witnesses or zero honest gaps is either trivial or dishonest.

## Output

Present the audit as a structured report in chat. Do not write a standalone audit document unless the user explicitly asks for one.

Update durable docs only where findings warrant:

- **`memory/SPEC.md`** — add or strengthen invariants the audit surfaced; record acknowledged blind spots in §Verification Design with revisit triggers.
- **`memory/PLAN.md`** — refresh `Verification` annotations on affected frontier items; queue follow-up frontier items for closing significant gaps.
- **Active scope card** — note discriminating tests to write in the current slice if they fit; do not let audit findings silently widen the slice.

### Cross-reference integrity

After writing, verify:
- Every promoted invariant has at least one named witnessing test
- Every acknowledged blind spot has a revisit trigger
- No rival marked **escalate** is left without a routing recommendation

## Routing

After presenting the audit, present these options to the user (use `tool-ask-question`):

| # | Label | Target | Why |
| --- | ---------------------- | ----------------- | ------------------------------------------------------------ |
| 1 | Close gaps now | `ln-scope` | Audit surfaced discriminating tests worth a focused slice |
| 2 | Disambiguate spec | `ln-disambiguate` | A rival revealed real ambiguity in intent |
| 3 | Revise spec | `ln-spec` | Audit promoted invariants or new blind spots into the spec |
| 4 | Revisit oracle strategy| `ln-oracles` | Gaps suggest the verification design itself is under-powered |
| 5 | Refactor tests | `ln-refactor` | Tests are correctly aimed but structurally weak |
| 6 | Back to triage | `ln-consult` | Findings reshape direction; reassess |

Recommended: **1** if gaps are local and cheap; **2** if a rival surfaced spec ambiguity; **4** if multiple kernels show systematic under-witnessing.
121 changes: 121 additions & 0 deletions .agents/skills/ln-witness/assets/witness-rubric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Witness Rubric

The rubric `ln-witness` measures tests against. Two parts: the **progressive-checkability ladder** (how strong is this test as evidence?) and the **per-kernel proof obligations** (what must be witnessed for each active kernel?).

Source synthesis: `docs/design/BEHAVIORAL_KERNELS.md` (kernel taxonomy, example tests), `archive/docs/archive/design/INTENT_SPEC_EVOLUTION.md` §2 (Progressive Checkability).

## Progressive-checkability ladder

A test's evidentiary strength is named by its rung, not scored 0–100. Higher rungs subsume lower ones; do not double-count.

| # | Rung | What it proves | Typical shape |
| --- | --- | --- | --- |
| 1 | **Positive example** | One concrete input produces an expected output. | `expect(f(x)).toBe(y)` |
| 2 | **Counterexample** | One concrete input is correctly rejected. | `expect(() => f(bad)).toThrow()` |
| 3 | **Regression test** | A previously-broken case stays fixed. Same shape as 1 or 2 plus provenance. | Named after the bug/issue it captures |
| 4 | **Property test** | A relation holds across many generated inputs (round-trip, idempotence, metamorphic, invariant, model-based). | `fc.assert(fc.property(...))` |
| 5 | **Runtime contract** | A predicate is enforced at the boundary every time the boundary is crossed in production, not only in tests. | `assert`, schema validation at I/O, type-narrowing guard |
| 6 | **State-machine rule** | A transition is permitted iff its guard holds; forbidden transitions are rejected by construction. | State-machine library, exhaustive transition test |
| 7 | **Invariant** | A property holds across *all* reachable states and mutations, enforced structurally (types, schema, encapsulation). | "Cannot construct an invalid X" — illegal states unrepresentable |
| 8 | **Proof obligation** | A formal property a verifier (Dafny / Lean / TLA+ / property-based with adversarial generators) discharges. | Discharged spec, model-checked transition system |

### Reading the ladder

- A test at rung 1 is not a defect — most tests live at rungs 1–3. The question is whether the *claim* deserves a higher rung. A claim labeled "invariant" in `memory/SPEC.md` witnessed only at rung 1 is the gap to surface.
- Rungs 5–7 are about *structural enforcement*: the system makes the bad state hard or impossible to reach, not just detected after the fact. Promotion from 4 to 5 is often the highest-leverage move.
- Rung 8 is rare in product code. It belongs in the audit only when the spec explicitly names a `proof_candidate` claim.

## Per-kernel proof obligations

For each behavioral kernel, the canonical mutations a sufficient test suite must witness. Drawn from `BEHAVIORAL_KERNELS.md` Proof obligations / Example tests sections; phrased as binary checks the audit can apply.

### Structural

**Identity & reference**
- Every entity has a stable identifier preserved across mutations.
- Dangling references are rejected at the boundary that creates them.
- Reference equality is distinguishable from value equality where it matters.

**Containment & topology**
- `add` preserves the topological invariant (acyclicity, single-parent, ordering, uniqueness — whichever apply).
- `move` preserves the invariant; an item appears in exactly one place after move.
- `delete` preserves the invariant; the declared cascade policy fires.
- `reorder` preserves the invariant; order is well-defined.

**Validation & normalization**
- Valid inputs accepted; canonical form is what reaches downstream code.
- Invalid inputs rejected at the boundary with a diagnosable error.
- Equivalent-but-non-canonical inputs normalize to the same canonical form (round-trip).

### Behavioral

**State & lifecycle**
- Every declared state is reachable from the initial state.
- Every declared transition is exercised under its guard.
- Terminal states are sinks (no transition leaves them, or only declared escape transitions do).
- Forbidden transitions are rejected, not silently no-op'd.

**Temporal history**
- Operations declared monotonic do not regress.
- Undo restores the prior state exactly; redo restores the post-state exactly.
- Audit / expiration policies fire on the declared schedule.

**Optimization & preference**
- The chosen outcome is valid (satisfies constraints) before it is optimal.
- Tie-breaking is deterministic and matches the declared rule.

### Multi-actor

**Authority & capability**
- Permitted action by an authorized actor succeeds.
- Same action by an unauthorized actor is rejected (not silently dropped).
- Delegated capability flows to the delegatee; bounded scope is enforced.
- Revocation propagates; previously-permitted actions are now rejected.

**Concurrency & collaboration**
- Concurrent compatible operations both succeed and converge to the same state.
- Concurrent conflicting operations resolve per the declared policy (LWW / FWW / conflict surface / merge), not by accident.
- Stale operations (based on outdated state) are detected and handled per policy.

### System

**Transactions & atomicity**
- A multi-object update either lands entirely or not at all.
- Partial failure leaves no observable intermediate state.
- Concurrent transactions do not interleave observably.

**Resource accounting**
- Conservation holds: sum of accounts before equals sum after, for every operation that claims to conserve.
- Limits are enforced at the boundary that creates demand, not only at observation time.
- Capacity exhaustion is rejected with a diagnosable error, not silently dropped.

**Derived data & views**
- After any source mutation, the derived view reflects it within the declared freshness window.
- Cache / index / projection cannot diverge from source without detection.

**Error & recovery**
- Declared retry policy fires under the declared trigger.
- Compensation runs when rollback is impossible (external effect already issued).
- Degraded mode is reachable and exits when health returns.

**External effects**
- Outbound call shape matches the contract the boundary declares.
- Inbound payload is validated before reaching domain code.
- Side effects are at-least-once / at-most-once / exactly-once per the declared guarantee.

### Evolution

**Change & migration**
- Old-format data round-trips through the migration without loss.
- Forward and backward compatibility hold for the declared compatibility window.

**Observability & evidence**
- Every operation the spec declares auditable produces a log/event with sufficient provenance.
- Logs do not contain prohibited content (secrets, PII per declared policy).

## How `ln-witness` uses this rubric

1. **Audit mode** maps each test to a ladder rung (column 3 of the audit table) and to the kernel obligations it satisfies (step 3 of the procedure).
2. **Rivalry mode** uses each kernel's obligations as the source of contrastive scenarios: a missing obligation often *is* the rival the tests fail to rule out, expressed as a discriminating scenario rather than as a checklist line.

A kernel obligation may be acceptably unwitnessed — but only with an explicit note in `memory/SPEC.md` §Verification Design saying *why* and *when to revisit*. The rubric does not produce a score; it produces a structured set of named gaps the user has to either close or knowingly defer.
1 change: 1 addition & 0 deletions archive/docs/archive/PLAN_HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,3 +191,4 @@ Archived out of `memory/PLAN.md` during `ln-sync` so the live plan keeps only th
- [2026-05-08] FE-698 Anthropic scenario adapter — Added a probe-only Anthropic AI SDK adapter behind the existing `PromptScenarioModelAdapter` seam. Web-research prompt scenarios now map rendered prompts to AI SDK system content and rendered context packs to user prompt content under mocked tests, with unsupported providers rejected before model construction. Verified: `npm run verify`. Watch: this is not the shared AI runtime provider seam; OpenRouter/provider-neutral routing, credential UX, Pi, web tools, CLI/UI, persistence, and Brunch mutations remain out of scope.
- [2026-05-08] FE-698 prompt scenario execution probe — Web-research prompt scenarios can now execute through an injected fakeable model adapter and serialize `succeeded` / `failed` execution results with raw output or deterministic error text, while no-provider artifacts remain deterministic `not-run` snapshots. Structured parsing is explicitly `not-applicable` for this prose-only web-research path. Verified: `npm run verify`. Watch: real provider adapters, Pi, web tools, CLI/UI, persistence, and mutating Brunch handlers remain out of scope for this foundation slice.
- [2026-05-06] Multi-chat substrate + reconciliation needs (FE-697) — `chat` table with one interview chat per spec, nullable `turn.chat_id`, `specification.primary_chat_id`, mirrored `chat.active_turn_id`, plus the `reconciliation_need` queue with directed source/target items, narrow `kind`/`status`, partial unique index on open rows, cascade FK. Spec creation inserts spec + interview chat in one transaction; `advanceHead` is transactional. No user-visible change. Verified: `npm run verify` (673 tests) plus manual fixture playback (39 specs / 81 turns / dual-pointer equivalence). A82 / A83 validated for Phase 1.
- 2026-05-20 — **Pre-POC archive and reseed** — razed pre-POC implementation, archived legacy docs and planning memory under `archive/`, tagged `next-baseline`, and reseeded `memory/SPEC.md` and `memory/PLAN.md` from the three canonical POC architecture docs. Phase 3 infra bootstrap was folded into `walking-skeleton` rather than remaining an independent frontier.
Loading
Loading