Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Ricky Evals

This directory holds human-authored product evals for Ricky. The shared loading,
filtering, deterministic checks, human-review marking, and run artifact writing
come from `@agent-assistant/telemetry/evals`; Ricky keeps the domain-specific
cases, rubrics, and product executors here.

## Start Here

Write new evals in `evals/suites/<suite>/cases.md`. Each `## case-id` block is
compiled into generated `cases.jsonl` by:

```sh
npm run evals:compile
```

Run all current evals:

```sh
npm run evals
```

Useful filters:

```sh
npm run evals -- --suite workflow-authoring
npm run evals -- --case workflow-authoring.deterministic-gates
npm run evals -- --tag local
npm run evals:list
```

Run history and review worksheets are written under `.ricky/evals/runs/`, which
is intentionally ignored by git.

## Running Against OpenCode

Ricky can also run the human-review cases against a local OpenCode one-shot
model. This path does not need `OPENROUTER_API_KEY`; it shells out to
`opencode run -m <model> <prompt>` and captures the answer into the normal
human-review worksheet.

```sh
npm run evals:opencode -- --suite workflow-authoring
```

By default this uses `opencode/minimax-m2.5-free`. Override the local/free model or
binary with environment variables:

```sh
RICKY_EVAL_OPENCODE_MODEL=opencode/nemotron-3-super-free npm run evals:opencode -- --tag workflow-authoring
RICKY_EVAL_OPENCODE_BIN=/path/to/opencode npm run evals:opencode -- --case generation-quality.workflow-contract
```

For a case-specific provider run, set `Executor: opencode` in the case. To run
the existing `Executor: manual` cases through OpenCode without editing them, use
`npm run evals:opencode`.

Agent Relay is still the better fit for heavier evals that need real worker
topology, tool-mediated execution, or multi-agent coordination. The direct
OpenCode executor is intentionally small so local quality sweeps stay cheap and
fast.

## Writing Manual Cases

Use `Executor: manual` when you want to capture a Ricky behavior expectation for
humans to judge. Put the user request in `### Message`, then write concrete
`### Must` and `### Must Not` bullets. These become the human-review rubric.

To evaluate a real Ricky answer manually, paste it into `### Candidate Output`
or point to a file with `### Candidate Output Path`. If no output is supplied,
the run still creates a review worksheet so the expected behavior is visible.

Minimal manual case:

```text
## workflow-authoring.your-case-id
Executor: manual
Kind: capability
Tags: workflow-authoring
Human Review: true

### Message
Ask Ricky to do the thing you care about.

### Must
- State the behavior a good Ricky response must show.

### Must Not
- State the regression or product failure this eval should catch.
```

## Deterministic CLI Cases

Use `Executor: ricky-cli` for small command-surface checks. Put the command
arguments in `### Mock` as `argv: ...`; the runner invokes the source CLI through
local `tsx`.

```text
## cli.example
Executor: ricky-cli
Kind: regression
Tags: cli

### Message
--help

### Mock
argv: --help

### Deterministic Checks
ok: true
contentIncludes:
- ricky run <artifact>
forbidPhrases:
- TypeError
```

Keep deterministic cases narrow and cheap. Use human-review cases for planning
quality, workflow authoring judgment, and any behavior where a senior engineer
needs to read the output.

## Source Map

The current suites sweep the repo's existing product and architecture docs:

- `cli-behavior` covers `README.md`, `docs/product/ricky-cli-onboarding-ux-spec.md`,
`docs/product/ricky-cofounder-interactive-readiness-checklist.md`, and
`specs/cli-version-from-package-json.md`.
- `workflow-authoring` covers `AGENTS.md`,
`docs/workflows/WORKFLOW_STANDARDS.md`,
`workflows/shared/WORKFLOW_AUTHORING_RULES.md`, and
workflow authoring expectations in `SPEC.md`.
- `runtime-recovery` covers `SPEC.md`,
`docs/architecture/ricky-failure-taxonomy-and-unblockers.md`,
`docs/architecture/ricky-runtime-architecture.md`,
`specs/cli-auto-fix-and-resume.md`, and
`specs/in-process-workflow-runner.md`.
- `surfaces-ingress` covers `docs/architecture/ricky-surfaces-and-ingress.md`,
`docs/product/ricky-cli-onboarding-ux-spec.md`,
`specs/cloud-runtime-execute-artifact.md`, and
`specs/linear-integration.md`.
- `generation-quality` covers `SPEC.md`,
`specs/workflow-generation-quality.md`, and
`docs/product/ricky-skill-embedding-boundary.md`.
- `agent-assistant-boundary` covers the Agent Assistant adoption audit,
boundary, proof, live proof, and local execution reuse documents under
`docs/product/`.
1 change: 1 addition & 0 deletions evals/fixtures/transcripts/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

7 changes: 7 additions & 0 deletions evals/suites/agent-assistant-boundary/cases.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Generated by scripts/evals/compile-ricky-evals.mjs from cases.md.
# Do not edit this file directly; edit cases.md in this suite instead.
{"id":"agent-assistant-boundary.real-reuse-not-rhetorical","suite":"agent-assistant-boundary","executor":"manual","kind":"regression","input":{"message":"Update Ricky docs and code to say it uses Agent Assistant more deeply."},"expected":{"maxToolCalls":0,"must":["Ground claims in real package imports and runtime paths.","Distinguish current implementation from target architecture.","Identify which Agent Assistant primitive is actually exercised."],"mustNot":["Rename local code to sound Agent Assistant aligned and count that as adoption.","Claim broad Agent Assistant native behavior from documentation-only alignment.","Blur target architecture with landed behavior."],"humanReviewRequired":true},"tags":["agent-assistant","boundary"]}
{"id":"agent-assistant-boundary.turn-context-preserves-ricky-envelope","suite":"agent-assistant-boundary","executor":"manual","kind":"regression","input":{"message":"Evaluate the current Ricky `@agent-assistant/turn-context` adoption."},"expected":{"maxToolCalls":0,"must":["Preserve request id, source metadata, structured spec, invocation root, mode, stage mode, spec path, metadata, and spec text.","Record compact provenance through generation decisions or coordinator metadata.","Keep the shared turn context internal to the adapter boundary."],"mustNot":["Move LocalResponse, blocker taxonomy, recovery wording, or execution semantics into the shared turn-context package.","Drop Ricky-specific workflow metadata during envelope assembly.","Treat turn context as a product decision engine."],"humanReviewRequired":true},"tags":["agent-assistant","turn-context"]}
{"id":"agent-assistant-boundary.product-core-stays-ricky-owned","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Decide whether workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording should move into Agent Assistant."},"expected":{"maxToolCalls":0,"must":["Keep product-defining workflow generation, validation, debugging, local UX, and evidence wording Ricky-owned until proof says otherwise.","Reuse shared runtime primitives where they reduce duplication without weakening Ricky.","Make extraction follow typed, tested, live product proof."],"mustNot":["Generalize workflow-specific behavior prematurely.","Adopt moving shared seams merely for architectural purity.","Lose the precise local-first staged workflow UX."],"humanReviewRequired":true},"tags":["agent-assistant","product-core"]}
{"id":"agent-assistant-boundary.one-slice-at-a-time","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Plan the next Agent Assistant adoption slice for Ricky."},"expected":{"maxToolCalls":0,"must":["Pick exactly one real shared seam to evaluate or adopt.","Define a live Ricky product path that will prove the adoption.","Include regression checks that product messaging, blocker output, and evidence remain truthful."],"mustNot":["Bundle sessions, memory, policy, proactive behavior, and execution extraction into one vague migration.","Skip the comparison/evaluation step for mature Ricky-local seams.","Treat adoption as successful without a live product-path proof."],"humanReviewRequired":true},"tags":["agent-assistant","adoption"]}
{"id":"agent-assistant-boundary.future-surfaces-use-shared-runtime","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Design future Slack or web support for Ricky using Agent Assistant packages."},"expected":{"maxToolCalls":0,"must":["Prefer shared surfaces, webhook-runtime, sessions, and routing primitives for future non-CLI interaction where mature.","Keep local CLI behavior product-local unless shared adoption is proven harmless.","Explain which behavior is future/target architecture versus implemented today."],"mustNot":["Preemptively add memory, policy, or proactive packages without a real Ricky product requirement.","Let future surface abstractions distort the current CLI contract.","Duplicate a mature Agent Assistant capability locally without justification."],"humanReviewRequired":true},"tags":["agent-assistant","surfaces"]}
114 changes: 114 additions & 0 deletions evals/suites/agent-assistant-boundary/cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Agent Assistant Boundary Cases

These cases come from the Agent Assistant audit, adoption boundary, local
execution contract evaluation, adoption proof, and live proof documents.

## agent-assistant-boundary.real-reuse-not-rhetorical
Executor: manual
Kind: regression
Tags: agent-assistant, boundary
Human Review: true

### Message
Update Ricky docs and code to say it uses Agent Assistant more deeply.

### Deterministic Checks
maxToolCalls: 0

### Must
- Ground claims in real package imports and runtime paths.
- Distinguish current implementation from target architecture.
- Identify which Agent Assistant primitive is actually exercised.

### Must Not
- Rename local code to sound Agent Assistant aligned and count that as adoption.
- Claim broad Agent Assistant native behavior from documentation-only alignment.
- Blur target architecture with landed behavior.

## agent-assistant-boundary.turn-context-preserves-ricky-envelope
Executor: manual
Kind: regression
Tags: agent-assistant, turn-context
Human Review: true

### Message
Evaluate the current Ricky `@agent-assistant/turn-context` adoption.

### Deterministic Checks
maxToolCalls: 0

### Must
- Preserve request id, source metadata, structured spec, invocation root, mode, stage mode, spec path, metadata, and spec text.
- Record compact provenance through generation decisions or coordinator metadata.
- Keep the shared turn context internal to the adapter boundary.

### Must Not
- Move LocalResponse, blocker taxonomy, recovery wording, or execution semantics into the shared turn-context package.
- Drop Ricky-specific workflow metadata during envelope assembly.
- Treat turn context as a product decision engine.

## agent-assistant-boundary.product-core-stays-ricky-owned
Executor: manual
Kind: capability
Tags: agent-assistant, product-core
Human Review: true

### Message
Decide whether workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording should move into Agent Assistant.

### Deterministic Checks
maxToolCalls: 0

### Must
- Keep product-defining workflow generation, validation, debugging, local UX, and evidence wording Ricky-owned until proof says otherwise.
- Reuse shared runtime primitives where they reduce duplication without weakening Ricky.
- Make extraction follow typed, tested, live product proof.

### Must Not
- Generalize workflow-specific behavior prematurely.
- Adopt moving shared seams merely for architectural purity.
- Lose the precise local-first staged workflow UX.

## agent-assistant-boundary.one-slice-at-a-time
Executor: manual
Kind: capability
Tags: agent-assistant, adoption
Human Review: true

### Message
Plan the next Agent Assistant adoption slice for Ricky.

### Deterministic Checks
maxToolCalls: 0

### Must
- Pick exactly one real shared seam to evaluate or adopt.
- Define a live Ricky product path that will prove the adoption.
- Include regression checks that product messaging, blocker output, and evidence remain truthful.

### Must Not
- Bundle sessions, memory, policy, proactive behavior, and execution extraction into one vague migration.
- Skip the comparison/evaluation step for mature Ricky-local seams.
- Treat adoption as successful without a live product-path proof.

## agent-assistant-boundary.future-surfaces-use-shared-runtime
Executor: manual
Kind: capability
Tags: agent-assistant, surfaces
Human Review: true

### Message
Design future Slack or web support for Ricky using Agent Assistant packages.

### Deterministic Checks
maxToolCalls: 0

### Must
- Prefer shared surfaces, webhook-runtime, sessions, and routing primitives for future non-CLI interaction where mature.
- Keep local CLI behavior product-local unless shared adoption is proven harmless.
- Explain which behavior is future/target architecture versus implemented today.

### Must Not
- Preemptively add memory, policy, or proactive packages without a real Ricky product requirement.
- Let future surface abstractions distort the current CLI contract.
- Duplicate a mature Agent Assistant capability locally without justification.
17 changes: 17 additions & 0 deletions evals/suites/agent-assistant-boundary/rubric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Agent Assistant Boundary Rubric

Use this suite for Ricky's relationship to Agent Assistant packages and shared
assistant runtime seams.

## Human Review Questions

1. Does the answer separate current implementation from target architecture?
2. Is shared reuse real, narrow, and proven?
3. Does Ricky keep workflow-specific product behavior local where appropriate?
4. Is extraction gated by typed tests and live product proof?
5. Does adoption reduce product burden rather than add indirection?

## Suggested Pass Bar

Pass only when the boundary is honest, specific, and grounded in actual Ricky
runtime behavior.
8 changes: 8 additions & 0 deletions evals/suites/cli-behavior/cases.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Generated by scripts/evals/compile-ricky-evals.mjs from cases.md.
# Do not edit this file directly; edit cases.md in this suite instead.
{"id":"cli.help-surfaces-local-cloud-and-run","suite":"cli-behavior","executor":"ricky-cli","kind":"regression","input":{"message":"--help"},"expected":{"ok":true,"contentIncludes":["ricky local --spec","ricky run <artifact>","ricky status"],"forbidPhrases":["TypeError","ReferenceError","stack trace"],"maxToolCalls":1,"must":["Show the user the local, Cloud, run, status, and connect surfaces without requiring interactive setup.","Keep the help output truthful to the implemented CLI commands."],"mustNot":["Print a stack trace or raw implementation failure for help.","Hide the local/BYOH run path behind Cloud-only language."],"humanReviewRequired":false},"tags":["cli","onboarding","local","cloud"],"mock":{"argv":"--help"}}
{"id":"cli.version-prints-package-version","suite":"cli-behavior","executor":"ricky-cli","kind":"regression","input":{"message":"version"},"expected":{"ok":true,"contentMatches":["^ricky 0\\.1\\.\\d+"],"forbidPhrases":["TypeError","ReferenceError"],"maxToolCalls":1,"must":["Print the package version as a short script-friendly value."],"mustNot":["Start the interactive onboarding flow for `version`."],"humanReviewRequired":false},"tags":["cli","packaging"],"mock":{"argv":"version"}}
{"id":"cli.generation-default-not-execution","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"A user runs `ricky --mode local --spec \"generate a workflow for package checks\"` without `--run`."},"expected":{"maxToolCalls":0,"must":["Say generation is the default and execution was not requested.","Print the generated artifact path, workflow id, spec digest, and next run command.","Avoid showing execution evidence for a generation-only request."],"mustNot":["Imply the workflow ran automatically.","Present a generation-only result as execution success.","Hide the opt-in commands for running the artifact."],"humanReviewRequired":true},"tags":["cli","onboarding","local"]}
{"id":"cli.first-run-copy-is-compact-and-truthful","suite":"cli-behavior","executor":"manual","kind":"capability","input":{"message":"Render Ricky's first-run CLI onboarding for a new user."},"expected":{"maxToolCalls":0,"must":["Show compact Ricky branding and clear Local / BYOH, Cloud, Both, and Just explore choices.","End every branch with a concrete next step.","Advertise only commands that are currently implemented."],"mustNot":["Sound like a launch page or documentation dump.","Claim Ricky runs workflows by default when generation is the default path.","Require web or Slack onboarding before CLI use."],"humanReviewRequired":true},"tags":["cli","onboarding"]}
{"id":"cli.recovery-guidance-no-stack-traces","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"A user gives Ricky an empty spec or a missing spec file."},"expected":{"maxToolCalls":0,"must":["Return a user-facing failure or guidance message with a real recovery command.","Distinguish generation failure from execution failure.","Show stack traces only when verbose diagnostic mode is requested."],"mustNot":["Crash with an uncaught exception in normal mode.","Suggest commands that do not exist.","Pretend a missing spec was accepted."],"humanReviewRequired":true},"tags":["cli","recovery"]}
{"id":"cli.status-does-not-invent-provider-state","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"Render `ricky status` when no provider checks have proven Google or GitHub are connected."},"expected":{"maxToolCalls":0,"must":["Report unknown or not-connected provider state honestly.","Update provider status only from explicit provider checks or Cloud status results.","Give concrete setup guidance for Cloud when relevant."],"mustNot":["Mark Google or GitHub connected because guidance text was shown.","Invent a provider connection URL or OAuth flow.","Show empty fields with no recovery guidance when config is missing."],"humanReviewRequired":true},"tags":["cli","status","cloud"]}
Loading