AgentWorkforce · khaliqgant · May 8, 2026 · May 8, 2026 · May 8, 2026 · May 8, 2026
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,147 @@
+# Ricky Evals
+
+This directory holds human-authored product evals for Ricky. The shared loading,
+filtering, deterministic checks, human-review marking, and run artifact writing
+come from `@agent-assistant/telemetry/evals`; Ricky keeps the domain-specific
+cases, rubrics, and product executors here.
+
+## Start Here
+
+Write new evals in `evals/suites/<suite>/cases.md`. Each `## case-id` block is
+compiled into generated `cases.jsonl` by:
+
+```sh
+npm run evals:compile
+```
+
+Run all current evals:
+
+```sh
+npm run evals
+```
+
+Useful filters:
+
+```sh
+npm run evals -- --suite workflow-authoring
+npm run evals -- --case workflow-authoring.deterministic-gates
+npm run evals -- --tag local
+npm run evals:list
+```
+
+Run history and review worksheets are written under `.ricky/evals/runs/`, which
+is intentionally ignored by git.
+
+## Running Against OpenCode
+
+Ricky can also run the human-review cases against a local OpenCode one-shot
+model. This path does not need `OPENROUTER_API_KEY`; it shells out to
+`opencode run -m <model> <prompt>` and captures the answer into the normal
+human-review worksheet.
+
+```sh
+npm run evals:opencode -- --suite workflow-authoring
+```
+
+By default this uses `opencode/minimax-m2.5-free`. Override the local/free model or
+binary with environment variables:
+
+```sh
+RICKY_EVAL_OPENCODE_MODEL=opencode/nemotron-3-super-free npm run evals:opencode -- --tag workflow-authoring
+RICKY_EVAL_OPENCODE_BIN=/path/to/opencode npm run evals:opencode -- --case generation-quality.workflow-contract
+```
+
+For a case-specific provider run, set `Executor: opencode` in the case. To run
+the existing `Executor: manual` cases through OpenCode without editing them, use
+`npm run evals:opencode`.
+
+Agent Relay is still the better fit for heavier evals that need real worker
+topology, tool-mediated execution, or multi-agent coordination. The direct
+OpenCode executor is intentionally small so local quality sweeps stay cheap and
+fast.
+
+## Writing Manual Cases
+
+Use `Executor: manual` when you want to capture a Ricky behavior expectation for
+humans to judge. Put the user request in `### Message`, then write concrete
+`### Must` and `### Must Not` bullets. These become the human-review rubric.
+
+To evaluate a real Ricky answer manually, paste it into `### Candidate Output`
+or point to a file with `### Candidate Output Path`. If no output is supplied,
+the run still creates a review worksheet so the expected behavior is visible.
+
+Minimal manual case:
+
+```text
+## workflow-authoring.your-case-id
+Executor: manual
+Kind: capability
+Tags: workflow-authoring
+Human Review: true
+
+### Message
+Ask Ricky to do the thing you care about.
+
+### Must
+- State the behavior a good Ricky response must show.
+
+### Must Not
+- State the regression or product failure this eval should catch.
+```
+
+## Deterministic CLI Cases
+
+Use `Executor: ricky-cli` for small command-surface checks. Put the command
+arguments in `### Mock` as `argv: ...`; the runner invokes the source CLI through
+local `tsx`.
+
+```text
+## cli.example
+Executor: ricky-cli
+Kind: regression
+Tags: cli
+
+### Message
+--help
+
+### Mock
+argv: --help
+
+### Deterministic Checks
+ok: true
+contentIncludes:
+- ricky run <artifact>
+forbidPhrases:
+- TypeError
+```
+
+Keep deterministic cases narrow and cheap. Use human-review cases for planning
+quality, workflow authoring judgment, and any behavior where a senior engineer
+needs to read the output.
+
+## Source Map
+
+The current suites sweep the repo's existing product and architecture docs:
+
+- `cli-behavior` covers `README.md`, `docs/product/ricky-cli-onboarding-ux-spec.md`,
+  `docs/product/ricky-cofounder-interactive-readiness-checklist.md`, and
+  `specs/cli-version-from-package-json.md`.
+- `workflow-authoring` covers `AGENTS.md`,
+  `docs/workflows/WORKFLOW_STANDARDS.md`,
+  `workflows/shared/WORKFLOW_AUTHORING_RULES.md`, and
+  workflow authoring expectations in `SPEC.md`.
+- `runtime-recovery` covers `SPEC.md`,
+  `docs/architecture/ricky-failure-taxonomy-and-unblockers.md`,
+  `docs/architecture/ricky-runtime-architecture.md`,
+  `specs/cli-auto-fix-and-resume.md`, and
+  `specs/in-process-workflow-runner.md`.
+- `surfaces-ingress` covers `docs/architecture/ricky-surfaces-and-ingress.md`,
+  `docs/product/ricky-cli-onboarding-ux-spec.md`,
+  `specs/cloud-runtime-execute-artifact.md`, and
+  `specs/linear-integration.md`.
+- `generation-quality` covers `SPEC.md`,
+  `specs/workflow-generation-quality.md`, and
+  `docs/product/ricky-skill-embedding-boundary.md`.
+- `agent-assistant-boundary` covers the Agent Assistant adoption audit,
+  boundary, proof, live proof, and local execution reuse documents under
+  `docs/product/`.
diff --git a/evals/fixtures/transcripts/.gitkeep b/evals/fixtures/transcripts/.gitkeep
@@ -0,0 +1 @@
+
diff --git a/evals/suites/agent-assistant-boundary/cases.jsonl b/evals/suites/agent-assistant-boundary/cases.jsonl
@@ -0,0 +1,7 @@
+# Generated by scripts/evals/compile-ricky-evals.mjs from cases.md.
+# Do not edit this file directly; edit cases.md in this suite instead.
+{"id":"agent-assistant-boundary.real-reuse-not-rhetorical","suite":"agent-assistant-boundary","executor":"manual","kind":"regression","input":{"message":"Update Ricky docs and code to say it uses Agent Assistant more deeply."},"expected":{"maxToolCalls":0,"must":["Ground claims in real package imports and runtime paths.","Distinguish current implementation from target architecture.","Identify which Agent Assistant primitive is actually exercised."],"mustNot":["Rename local code to sound Agent Assistant aligned and count that as adoption.","Claim broad Agent Assistant native behavior from documentation-only alignment.","Blur target architecture with landed behavior."],"humanReviewRequired":true},"tags":["agent-assistant","boundary"]}
+{"id":"agent-assistant-boundary.turn-context-preserves-ricky-envelope","suite":"agent-assistant-boundary","executor":"manual","kind":"regression","input":{"message":"Evaluate the current Ricky `@agent-assistant/turn-context` adoption."},"expected":{"maxToolCalls":0,"must":["Preserve request id, source metadata, structured spec, invocation root, mode, stage mode, spec path, metadata, and spec text.","Record compact provenance through generation decisions or coordinator metadata.","Keep the shared turn context internal to the adapter boundary."],"mustNot":["Move LocalResponse, blocker taxonomy, recovery wording, or execution semantics into the shared turn-context package.","Drop Ricky-specific workflow metadata during envelope assembly.","Treat turn context as a product decision engine."],"humanReviewRequired":true},"tags":["agent-assistant","turn-context"]}
+{"id":"agent-assistant-boundary.product-core-stays-ricky-owned","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Decide whether workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording should move into Agent Assistant."},"expected":{"maxToolCalls":0,"must":["Keep product-defining workflow generation, validation, debugging, local UX, and evidence wording Ricky-owned until proof says otherwise.","Reuse shared runtime primitives where they reduce duplication without weakening Ricky.","Make extraction follow typed, tested, live product proof."],"mustNot":["Generalize workflow-specific behavior prematurely.","Adopt moving shared seams merely for architectural purity.","Lose the precise local-first staged workflow UX."],"humanReviewRequired":true},"tags":["agent-assistant","product-core"]}
+{"id":"agent-assistant-boundary.one-slice-at-a-time","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Plan the next Agent Assistant adoption slice for Ricky."},"expected":{"maxToolCalls":0,"must":["Pick exactly one real shared seam to evaluate or adopt.","Define a live Ricky product path that will prove the adoption.","Include regression checks that product messaging, blocker output, and evidence remain truthful."],"mustNot":["Bundle sessions, memory, policy, proactive behavior, and execution extraction into one vague migration.","Skip the comparison/evaluation step for mature Ricky-local seams.","Treat adoption as successful without a live product-path proof."],"humanReviewRequired":true},"tags":["agent-assistant","adoption"]}
+{"id":"agent-assistant-boundary.future-surfaces-use-shared-runtime","suite":"agent-assistant-boundary","executor":"manual","kind":"capability","input":{"message":"Design future Slack or web support for Ricky using Agent Assistant packages."},"expected":{"maxToolCalls":0,"must":["Prefer shared surfaces, webhook-runtime, sessions, and routing primitives for future non-CLI interaction where mature.","Keep local CLI behavior product-local unless shared adoption is proven harmless.","Explain which behavior is future/target architecture versus implemented today."],"mustNot":["Preemptively add memory, policy, or proactive packages without a real Ricky product requirement.","Let future surface abstractions distort the current CLI contract.","Duplicate a mature Agent Assistant capability locally without justification."],"humanReviewRequired":true},"tags":["agent-assistant","surfaces"]}
diff --git a/evals/suites/agent-assistant-boundary/cases.md b/evals/suites/agent-assistant-boundary/cases.md
@@ -0,0 +1,114 @@
+# Agent Assistant Boundary Cases
+
+These cases come from the Agent Assistant audit, adoption boundary, local
+execution contract evaluation, adoption proof, and live proof documents.
+
+## agent-assistant-boundary.real-reuse-not-rhetorical
+Executor: manual
+Kind: regression
+Tags: agent-assistant, boundary
+Human Review: true
+
+### Message
+Update Ricky docs and code to say it uses Agent Assistant more deeply.
+
+### Deterministic Checks
+maxToolCalls: 0
+
+### Must
+- Ground claims in real package imports and runtime paths.
+- Distinguish current implementation from target architecture.
+- Identify which Agent Assistant primitive is actually exercised.
+
+### Must Not
+- Rename local code to sound Agent Assistant aligned and count that as adoption.
+- Claim broad Agent Assistant native behavior from documentation-only alignment.
+- Blur target architecture with landed behavior.
+
+## agent-assistant-boundary.turn-context-preserves-ricky-envelope
+Executor: manual
+Kind: regression
+Tags: agent-assistant, turn-context
+Human Review: true
+
+### Message
+Evaluate the current Ricky `@agent-assistant/turn-context` adoption.
+
+### Deterministic Checks
+maxToolCalls: 0
+
+### Must
+- Preserve request id, source metadata, structured spec, invocation root, mode, stage mode, spec path, metadata, and spec text.
+- Record compact provenance through generation decisions or coordinator metadata.
+- Keep the shared turn context internal to the adapter boundary.
+
+### Must Not
+- Move LocalResponse, blocker taxonomy, recovery wording, or execution semantics into the shared turn-context package.
+- Drop Ricky-specific workflow metadata during envelope assembly.
+- Treat turn context as a product decision engine.
+
+## agent-assistant-boundary.product-core-stays-ricky-owned
+Executor: manual
+Kind: capability
+Tags: agent-assistant, product-core
+Human Review: true
+
+### Message
+Decide whether workflow generation, validation, debugging, staged CLI UX, and blocker/evidence wording should move into Agent Assistant.
+
+### Deterministic Checks
+maxToolCalls: 0
+
+### Must
+- Keep product-defining workflow generation, validation, debugging, local UX, and evidence wording Ricky-owned until proof says otherwise.
+- Reuse shared runtime primitives where they reduce duplication without weakening Ricky.
+- Make extraction follow typed, tested, live product proof.
+
+### Must Not
+- Generalize workflow-specific behavior prematurely.
+- Adopt moving shared seams merely for architectural purity.
+- Lose the precise local-first staged workflow UX.
+
+## agent-assistant-boundary.one-slice-at-a-time
+Executor: manual
+Kind: capability
+Tags: agent-assistant, adoption
+Human Review: true
+
+### Message
+Plan the next Agent Assistant adoption slice for Ricky.
+
+### Deterministic Checks
+maxToolCalls: 0
+
+### Must
+- Pick exactly one real shared seam to evaluate or adopt.
+- Define a live Ricky product path that will prove the adoption.
+- Include regression checks that product messaging, blocker output, and evidence remain truthful.
+
+### Must Not
+- Bundle sessions, memory, policy, proactive behavior, and execution extraction into one vague migration.
+- Skip the comparison/evaluation step for mature Ricky-local seams.
+- Treat adoption as successful without a live product-path proof.
+
+## agent-assistant-boundary.future-surfaces-use-shared-runtime
+Executor: manual
+Kind: capability
+Tags: agent-assistant, surfaces
+Human Review: true
+
+### Message
+Design future Slack or web support for Ricky using Agent Assistant packages.
+
+### Deterministic Checks
+maxToolCalls: 0
+
+### Must
+- Prefer shared surfaces, webhook-runtime, sessions, and routing primitives for future non-CLI interaction where mature.
+- Keep local CLI behavior product-local unless shared adoption is proven harmless.
+- Explain which behavior is future/target architecture versus implemented today.
+
+### Must Not
+- Preemptively add memory, policy, or proactive packages without a real Ricky product requirement.
+- Let future surface abstractions distort the current CLI contract.
+- Duplicate a mature Agent Assistant capability locally without justification.
diff --git a/evals/suites/agent-assistant-boundary/rubric.md b/evals/suites/agent-assistant-boundary/rubric.md
@@ -0,0 +1,17 @@
+# Agent Assistant Boundary Rubric
+
+Use this suite for Ricky's relationship to Agent Assistant packages and shared
+assistant runtime seams.
+
+## Human Review Questions
+
+1. Does the answer separate current implementation from target architecture?
+2. Is shared reuse real, narrow, and proven?
+3. Does Ricky keep workflow-specific product behavior local where appropriate?
+4. Is extraction gated by typed tests and live product proof?
+5. Does adoption reduce product burden rather than add indirection?
+
+## Suggested Pass Bar
+
+Pass only when the boundary is honest, specific, and grounded in actual Ricky
+runtime behavior.
diff --git a/evals/suites/cli-behavior/cases.jsonl b/evals/suites/cli-behavior/cases.jsonl
@@ -0,0 +1,8 @@
+# Generated by scripts/evals/compile-ricky-evals.mjs from cases.md.
+# Do not edit this file directly; edit cases.md in this suite instead.
+{"id":"cli.help-surfaces-local-cloud-and-run","suite":"cli-behavior","executor":"ricky-cli","kind":"regression","input":{"message":"--help"},"expected":{"ok":true,"contentIncludes":["ricky local --spec","ricky run <artifact>","ricky status"],"forbidPhrases":["TypeError","ReferenceError","stack trace"],"maxToolCalls":1,"must":["Show the user the local, Cloud, run, status, and connect surfaces without requiring interactive setup.","Keep the help output truthful to the implemented CLI commands."],"mustNot":["Print a stack trace or raw implementation failure for help.","Hide the local/BYOH run path behind Cloud-only language."],"humanReviewRequired":false},"tags":["cli","onboarding","local","cloud"],"mock":{"argv":"--help"}}
+{"id":"cli.version-prints-package-version","suite":"cli-behavior","executor":"ricky-cli","kind":"regression","input":{"message":"version"},"expected":{"ok":true,"contentMatches":["^ricky 0\\.1\\.\\d+"],"forbidPhrases":["TypeError","ReferenceError"],"maxToolCalls":1,"must":["Print the package version as a short script-friendly value."],"mustNot":["Start the interactive onboarding flow for `version`."],"humanReviewRequired":false},"tags":["cli","packaging"],"mock":{"argv":"version"}}
+{"id":"cli.generation-default-not-execution","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"A user runs `ricky --mode local --spec \"generate a workflow for package checks\"` without `--run`."},"expected":{"maxToolCalls":0,"must":["Say generation is the default and execution was not requested.","Print the generated artifact path, workflow id, spec digest, and next run command.","Avoid showing execution evidence for a generation-only request."],"mustNot":["Imply the workflow ran automatically.","Present a generation-only result as execution success.","Hide the opt-in commands for running the artifact."],"humanReviewRequired":true},"tags":["cli","onboarding","local"]}
+{"id":"cli.first-run-copy-is-compact-and-truthful","suite":"cli-behavior","executor":"manual","kind":"capability","input":{"message":"Render Ricky's first-run CLI onboarding for a new user."},"expected":{"maxToolCalls":0,"must":["Show compact Ricky branding and clear Local / BYOH, Cloud, Both, and Just explore choices.","End every branch with a concrete next step.","Advertise only commands that are currently implemented."],"mustNot":["Sound like a launch page or documentation dump.","Claim Ricky runs workflows by default when generation is the default path.","Require web or Slack onboarding before CLI use."],"humanReviewRequired":true},"tags":["cli","onboarding"]}
+{"id":"cli.recovery-guidance-no-stack-traces","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"A user gives Ricky an empty spec or a missing spec file."},"expected":{"maxToolCalls":0,"must":["Return a user-facing failure or guidance message with a real recovery command.","Distinguish generation failure from execution failure.","Show stack traces only when verbose diagnostic mode is requested."],"mustNot":["Crash with an uncaught exception in normal mode.","Suggest commands that do not exist.","Pretend a missing spec was accepted."],"humanReviewRequired":true},"tags":["cli","recovery"]}
+{"id":"cli.status-does-not-invent-provider-state","suite":"cli-behavior","executor":"manual","kind":"regression","input":{"message":"Render `ricky status` when no provider checks have proven Google or GitHub are connected."},"expected":{"maxToolCalls":0,"must":["Report unknown or not-connected provider state honestly.","Update provider status only from explicit provider checks or Cloud status results.","Give concrete setup guidance for Cloud when relevant."],"mustNot":["Mark Google or GitHub connected because guidance text was shown.","Invent a provider connection URL or OAuth flow.","Show empty fields with no recovery guidance when config is missing."],"humanReviewRequired":true},"tags":["cli","status","cloud"]}