Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,7 @@ Kapi uses Pi extension surfaces as thin safety rails rather than a separate orch
- `docs/ilchul-naming-policy.md` — product naming, compatibility, and active `.ilchul` storage policy.
- `docs/ilchul-runtime-config.md` — design contract for `.ilchul/` runtime layout, adapter config defaults, worker retention states, and safe cleanup boundaries.
- `docs/learning-runtime-boundaries.md` — design contract for future `RunState`, objective, policy selection, task graph, worker runtime, evidence/evaluation, integration/repair, and reward-ledger boundaries.
- `docs/learning-runtime-verification-matrix.md` — verification matrix for schema/events/DAG/claims/workers/policy/reward/integration/retention/storage readiness before learning-runtime default claims.
- `docs/ralph-live-qa.md` — operator live QA checklist for proving `/kapi-ralph` start, planning, approval, build, evidence, closeout, and resume behavior in a real Pi/Kapi runtime.
- `skills/kapi-workflow/SKILL.md` — active-workflow behavior reminders for agents.
- `prompts/` — Kapi prompt resources exposed to Pi.
Expand Down
118 changes: 118 additions & 0 deletions docs/learning-runtime-verification-matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Learning runtime verification matrix

Issue: #191
Parent: #167

This matrix defines the evidence required before Ilchul claims the objective-driven learning parallel runtime is implementation-ready or changes runtime defaults. It is a design contract, not a CI policy flip.

## Scope and rule

The runtime is considered ready only when every MVP invariant maps to at least one unit, integration, E2E smoke, or failure-mode check. Design-only issues can close with documented acceptance evidence, but runtime implementation issues must point at executable tests, fixtures, or smoke records.

Verification layers:

1. **Unit** — pure domain functions, schema parsers, state/event reducers, deterministic policy math.
2. **Integration** — service or adapter seams with fake stores/substrates; no real agent authority required.
3. **E2E smoke** — one local substitute or fake-worker run that exercises the full runtime path and records evidence refs.
4. **Failure mode** — fail-closed behavior for malformed state, stale claims, missing evidence, unsafe storage, and conflicts.

## Unit test matrix

| Runtime area | Required unit coverage | Existing or target surface |
|---|---|---|
| Runtime schema validation | `RuntimeState.schemaVersion`, enum fields, optional nested records, artifact/evidence refs, and unknown newer versions fail closed. | `test/runtime-state.test.ts` |
| Runtime event replay | event envelope validation, monotonic `seq`, exact duplicate idempotency, conflicting duplicate rejection, and sealed-run mutation boundary. | `test/runtime-events.test.ts` |
| Phase and preset contracts | phase preset schemas, graph-execution boundaries, thin phase outputs, and side-effect request separation. | `test/phase-preset.test.ts`, `test/graph-execution-components.test.ts` |
| TaskGraph readiness | duplicate ids, missing dependencies, cycles, topological order, explicit ready transition, downstream block after dependency failure, and repair supersession. | `test/task-graph.test.ts` |
| Claim and lease ownership | token creation, duplicate active ownership rejection, lease renewal, release, expiry, explicit recovery, and completion with matching unexpired claim. | `test/task-graph.test.ts` |
| Worker execution state | readiness nonce, claimed-task dispatch, duplicate dispatch rejection, heartbeat refresh, stale projection, structured report capture, and evidence-gated completion. | `test/task-graph.test.ts` |
| Policy simulation determinism | stable policy ids, deterministic scores for identical inputs, invalid objective rejection, exploration caps, blocked policy ids, and human override trails. | `test/policy-selector.test.ts` |
| Objective/evaluation guardrails | explicit success/failure/repair criteria, metric direction, non-finite score rejection, anti-Goodhart flags, and advisory-only quality output. | `test/objective.test.ts`, `test/quality-probe-matrix.test.ts` |
| RewardRecord and PolicyHint | prediction-vs-actual records, penalty taxonomy, calibration refs, advisory `PolicyHint`, and no silent policy mutation. | policy/reward unit fixture target |
| IntegrationCandidate and repair | candidate refs, dry-run/conflict state, superseded tasks, repair budget, and repair evidence requirements. | post-MVP target from #195/#190 |
| Retention and safe close | worker statuses `completed-retained`, `safe-to-close`, `stale-registry`, `cleanup-released`, and `closed`; no destructive cleanup from validation alone. | runtime/worker state tests |
| `.ilchul` compatibility | active storage root, unsafe `.kapi` mutation absence, artifact root validation, and compatibility diagnostics. | storage/config tests |

## Integration test matrix

| Integration seam | Required coverage | Minimum fixture |
|---|---|---|
| RunState + EventStore | commit transition writes durable intent before side-effect execution and replay reconstructs current snapshot. | temp `.ilchul` workspace with snapshot + events fixture |
| RunOrchestrator + GateEngine + Verifier | HardInvariantGate, PhasePresetGate, and RunObjectiveGate deny mutation on failure and record blocker evidence. | fake verifier returning pass/block/repair/human-decision |
| TaskGraph + worker substrate | two independent claimed tasks dispatch through a fake substrate and update worker/task state from reports only. | two fake workers with readiness nonces and evidence refs |
| Adapter contract | Codex, Pi, and Claude Code compatibility assumptions stay behind the `AgentAdapter` / `ExecutionSubstrate` contract. | fake adapter matrix; no real agent required |
| Evidence extraction | reports/logs/test output/diff/artifact refs produce bounded `EvidenceRef` values; missing refs deny completion. | synthetic report bundle with one valid and one missing ref |
| Policy/evaluation/reward | selected policy emits prediction id, evaluation records actual result, reward records delta, but no future policy changes happen without `policy.selected`. | policy-selector fixture plus reward-ledger fixture |
| Integration dry-run | clean candidate, conflict candidate, and repair candidate remain refs until an explicit integration gate passes. | fake candidate refs and conflict matrix |
| Retention lifecycle | terminal runs preserve retained worker inspection data and only mark safe close through explicit retention state. | fake worker registry with retained and stale handles |
| Storage compatibility | `.ilchul` runtime files are read/written under validated roots; legacy `.kapi` is never deleted or silently migrated. | temp workspace containing both roots |

## E2E smoke path

A minimal runtime-readiness smoke must prove this path without relying on narrative agent claims:

1. Start from an approved `RunObjective` with success criteria, failure criteria, repair criteria, and constraints.
2. Select a deterministic policy and record `policy.selected` with prediction metadata.
3. Create a concrete `TaskGraph` with at least two independent ready branches and one downstream join task.
4. Claim both ready branches with lease tokens and dispatch them to fake or local-substitute workers.
5. Record worker readiness nonces, heartbeats, and structured reports.
6. Complete branch tasks only when matching unexpired claims and `EvidenceRef` values exist.
7. Evaluate the objective, record `EvaluationResult`, and emit a `RewardRecord` / advisory `PolicyHint`.
8. Produce an `IntegrationCandidate` or explicit post-MVP skip reason.
9. Seal the run with snapshot, event replay check, retained worker state, and closeout evidence.

Minimum E2E evidence bundle:

- `state.json` or equivalent runtime snapshot;
- `events.jsonl` with replayable event sequence;
- worker report fixture(s) with evidence refs;
- objective/evaluation artifact;
- reward-ledger or explicit shallow-first reward fixture;
- integration dry-run report or explicit post-MVP skip reason;
- verification command output.

## Failure-mode matrix

| Failure mode | Required expected behavior |
|---|---|
| Unknown newer schema version | Reject/fail closed; no downgrade mutation. |
| Malformed event or conflicting duplicate | Reject replay/commit; preserve prior snapshot. |
| Missing dependency or cycle | Reject graph validation; no readiness projection. |
| Failed dependency | Block downstream tasks unless an explicit repair supersedes it. |
| Duplicate active claim | Reject second claim; preserve original owner and lease. |
| Expired lease completion | Reject completion and require explicit recovery. |
| Stale worker heartbeat | Mark worker unhealthy/stale without completing or deleting the task. |
| Late duplicate worker report | Reject reports for non-`in_progress` tasks without mutating terminal state. |
| Missing evidence refs | Reject task, phase, and run completion. |
| Non-deterministic policy selector | Fail test; identical inputs must produce identical selected policy and score. |
| Reward/evaluation non-finite values | Reject record serialization and policy scoring. |
| Conflict during integration dry-run | Record blocked/repair state; no hidden merge or branch mutation. |
| Unsafe artifact or storage root | Refuse read/write; produce diagnostic blocker. |
| Retained or stale worker handle | Preserve inspectability; no cleanup unless explicit safe-close policy applies. |

## Child-issue closeout evidence

| Issue | Minimum evidence before close |
|---:|---|
| #185 | runtime schema tests, unknown-newer fail-closed test, successful and repair-required examples. |
| #186 | event taxonomy tests, replay/idempotency/conflict tests, sealed mutation boundary. |
| #188 | adapter/substrate matrix, fake adapter pass, compatibility docs for Codex/Pi/Claude Code. |
| #194 | 5-task graph fixture, readiness reasons, dependency failure blocking, graph validation failures. |
| #197 | duplicate claim race, expired lease, renewal/release, stale recovery, evidence-gated completion. |
| #196 | two-worker dispatch fixture, heartbeat/stale projection, structured report capture, evidence-gated completion, stale/late report rejection. |
| #191 | this matrix linked from README and tested for unit/integration/E2E/failure/fixtures/closeout coverage. |
| #189 | RewardRecord, PredictionDelta, penalties, advisory PolicyHint, and no silent policy mutation. |
| #187 | deterministic simulator features, exploration caps, blocked policies, override trail. |
| #190 | repair/supersession semantics, budget limits, failure taxonomy, no hidden mutation. |
| #195 | IntegrationCandidate refs, dry-run evidence, conflict fixture, repair-loop fixture. |

## Readiness checklist

- [x] Unit test matrix is defined.
- [x] Integration test matrix is defined.
- [x] E2E smoke path is defined.
- [x] Failure-mode tests are defined.
- [x] Required fixtures/artifacts are listed.
- [x] Minimum evidence for closing each child issue is documented.

Default/runtime-readiness claims remain blocked until the matrix has executable coverage or recorded smoke evidence for every MVP-critical row.
48 changes: 48 additions & 0 deletions test/learning-runtime-verification-matrix.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import * as assert from "node:assert/strict";
import { readFile } from "node:fs/promises";
import { test } from "node:test";

const docPath = "docs/learning-runtime-verification-matrix.md";

async function doc(): Promise<string> {
return readFile(docPath, "utf8");
}

test("learning runtime verification matrix covers required verification layers", async () => {
const text = await doc();

for (const heading of ["## Unit test matrix", "## Integration test matrix", "## E2E smoke path", "## Failure-mode matrix", "## Child-issue closeout evidence"]) {
assert.match(text, new RegExp(heading));
}

for (const requiredArea of [
"Runtime schema validation",
"Runtime event replay",
"TaskGraph readiness",
"Claim and lease ownership",
"Worker execution state",
"Policy simulation determinism",
"RewardRecord and PolicyHint",
"IntegrationCandidate and repair",
"Retention and safe close",
"`.ilchul` compatibility",
]) {
assert.match(text, new RegExp(requiredArea.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")));
}
});

test("learning runtime verification matrix defines concrete smoke evidence and closeout gates", async () => {
const [readme, text] = await Promise.all([readFile("README.md", "utf8"), doc()]);

assert.match(readme, /docs\/learning-runtime-verification-matrix\.md/);

for (const artifact of ["state.json", "events.jsonl", "worker report", "objective/evaluation", "reward-ledger", "integration dry-run"]) {
assert.match(text, new RegExp(artifact));
}

for (const issue of ["#185", "#186", "#188", "#194", "#197", "#196", "#191", "#189", "#187", "#190", "#195"]) {
assert.match(text, new RegExp(issue));
}

assert.match(text, /Default\/runtime-readiness claims remain blocked/);
});
Loading