Skip to content

[codex] Add benchmark v2 experiment rig#34

Draft
devteapot wants to merge 1 commit intomainfrom
work/benchmark-v2
Draft

[codex] Add benchmark v2 experiment rig#34
devteapot wants to merge 1 commit intomainfrom
work/benchmark-v2

Conversation

@devteapot
Copy link
Copy Markdown
Owner

Summary

Adds the WIP benchmark v2 experiment rig under benchmarks/v2 and registers it as a root workspace package.

This includes:

  • OpenAI-compatible provider plumbing and smoke-test entrypoint
  • SLOP and MCP benchmark cells
  • Todo, file-browser, and CRM benchmark app fixtures
  • Sweep configs, prompt/encoding/optimization variants, metrics aggregation, and a static dashboard
  • Checked-in sample result JSONL/aggregate outputs for the current smoke runs

Impact

This gives the repo a dedicated v2 benchmark workspace for comparing SLOP/MCP variants and collecting data to inform v0.2 protocol decisions, while keeping the existing benchmarks/mcp-vs-slop benchmark in place.

Validation

  • bunx tsc -p benchmarks/v2/tsconfig.json --noEmit

Notes:

  • bun run preflight --list --files benchmarks/v2 package.json bun.lock reports no configured checks for slop-benchmarks-v2; a full workspace preflight fans out because the root workspace and lockfile changed.
  • The DGX smoke script was not run here because it depends on the local SLOP_DGX_URL/Ollama endpoint described in benchmarks/v2/README.md.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
slop-demo Ready Ready Preview, Comment Apr 17, 2026 10:47am
slop-docs Ready Ready Preview, Comment Apr 17, 2026 10:47am
slop-landing Ready Ready Preview, Comment Apr 17, 2026 10:47am
slop-playground Ready Ready Preview, Comment Apr 17, 2026 10:47am

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant