agent-eval exposes its evaluation logic over a versioned wire protocol so non-TypeScript clients (Python, Rust, Go, …) can drive it without a parallel implementation. The TypeScript runtime is the single source of truth; clients in other languages are transport adapters, not ports.
your code (any language)
│
▼
thin transport client ──HTTP──▶ agent-eval serve ──┐
│ │
└─────subprocess────────▶ agent-eval rpc ──┤
▼
same TS handlers, same rubrics,
same scoring code
Both transports talk to identical handlers. If you need a sustained connection (live agent paths, high-frequency calls), use HTTP. If you need a one-shot (cron, CI, batch), use stdio RPC. The wire shape is the same.
| HTTP | stdio RPC | |
|---|---|---|
| Start | agent-eval serve --port 5005 |
per-call: agent-eval rpc <method> |
| Latency | ~10 ms | ~500 ms (Node startup) |
| Best for | live calls, agent paths, dashboards | cron, CI, batch evaluation |
| Requires | running server | binary on PATH |
The current surface is the smallest useful slice. Adding a method is mechanical — see §Adding a method.
POST /v1/judge
{
"rubricName": "anti-slop",
"content": "We just shipped zero-copy IO between sandboxes",
"context": { "platform": "x", "author": "drew", "impressions": 1240 }
}{
"composite": 0.78,
"dimensions": { "buyer_quality": 0.85, "voice": 0.7, "signal": 0.8 },
"failureModes": [],
"wins": ["specific-component", "earned-detail"],
"rationale": "Specific architectural detail, no AI cadence, technical voice.",
"rubricVersion": "anti-slop@a4f2b8c1",
"model": "claude-sonnet-4-6",
"durationMs": 1840
}Pass either rubricName (built-in) or rubric (inline definition). Not both. The handler:
- Resolves the rubric.
- Calls the judging LLM with a JSON-schema-constrained response.
- Computes
composite = Σ(weight_i × normalized_score_i) / Σ(weight_i). - Returns a typed
JudgeResult.
rubricVersion is the stable hash of the rubric used. Scores are only comparable across runs when this matches.
GET /v1/rubrics{
"rubrics": [
{
"name": "anti-slop",
"description": "Voice and signal quality for technical-buyer content.",
"dimensions": [
{ "id": "buyer_quality", "description": "Would the target buyer care?", "weight": 0.5 },
{ "id": "voice", "description": "Builder voice, not AI/marketing?", "weight": 0.3 },
{ "id": "signal", "description": "Non-obvious detail or constraint?", "weight": 0.2 }
],
"failureModes": ["ai-cadence", "marketing-tone", "vague-claim", "no-hook", "engagement-bait", "off-icp", "stale-claim"],
"rubricVersion": "anti-slop@a4f2b8c1"
}
]
}GET /v1/version{
"package": "@tangle-network/agent-eval",
"version": "0.20.10",
"wireVersion": "1.0.0",
"apiSurface": ["judge", "listRubrics", "version"]
}version matches the package version. wireVersion bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as wireVersion matches.
For probing whether a server is up. Returns { "status": "ok", "uptimeSec": <number> }.
Auto-generated from the Zod schemas. This is what code generators consume to produce typed clients in other languages.
Every error response uses the same shape:
{
"error": {
"code": "rubric_not_found",
"message": "No built-in rubric named \"missing-name\".",
"details": null
}
}| HTTP | code | meaning |
|---|---|---|
| 400 | validation_error |
Request didn't match the schema. |
| 404 | rubric_not_found |
Unknown rubricName. |
| 500 | judge_error |
LLM returned malformed output. |
| 500 | internal_error |
Unexpected server error. |
stdio RPC uses the same shape inside an envelope: {"error": {...}} instead of {"result": {...}}. Exit code is non-zero on error.
agent-eval serve --port 5005 --host 127.0.0.1Defaults to 127.0.0.1:5005. Bind to 0.0.0.0 only if you trust the network.
# health
curl http://localhost:5005/healthz
# discover
curl http://localhost:5005/v1/rubrics | jq
# judge
curl -X POST http://localhost:5005/v1/judge \
-H 'content-type: application/json' \
-d '{"rubricName":"anti-slop","content":"We just shipped …"}'# version
echo '{}' | agent-eval rpc version
# listRubrics
echo '{}' | agent-eval rpc listRubrics
# judge (one-shot)
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
# JSONL batch — one request per line
cat requests.jsonl | agent-eval rpc-batch judge > results.jsonlEach invocation is one process — Node startup adds ~500 ms. For more than a few calls, stand up a server.
- Python: source lives in
clients/python. Auto-detects HTTP, falls back to subprocess. Version-locked to npm. - TypeScript: import directly from
@tangle-network/agent-eval(no wire round-trip needed in-process). - Rust / Go / Other: generate from
dist/openapi.json. PRs welcome to add an officially-maintained client.
- Schema — define
XRequestSchemaandXResponseSchemainsrc/wire/schemas.ts. Every field gets a.describe()so docs flow through to OpenAPI. - Handler — pure function in
src/wire/handlers.ts. ThrowsWireErrorfor caller-fixable issues. - Server route —
app.post('/v1/x', …)insrc/wire/server.ts. - RPC case — add
case 'x':indispatchRpcinsrc/wire/rpc.ts. - OpenAPI route — register in
src/wire/openapi.tsso it shows up in the spec. - Test — add to
tests/wire/. At minimum: schema validation, happy-path, error-path. - Python client — add a method on
Clientinclients/python/src/agent_eval_rpc/client.py, plus pydantic models inmodels.pymirroring the new schemas.
The pattern is mechanical. When the surface grows past ~10 methods, swap the hand-written Python models for datamodel-code-generator -i openapi.json -o models.py.
WIRE_VERSION (in src/wire/schemas.ts) is a separate semver from the npm/PyPI package version. It bumps on breaking changes to a request/response schema. Additive changes (new optional fields, new methods) don't require a bump.
When WIRE_VERSION bumps, every language client gets a new major version; the dual-publish CI (see .github/workflows/publish.yml) enforces this lock-step.