This directory is the reproducible home for external benchmark harnesses, Cortex adapters, configs, and saved benchmark outputs.
It intentionally separates:
benchmark/: existing in-repo Cortex recall/metric scriptsbenchmarking/: external benchmark suites plus the glue needed to run Cortex against them
Cortex ships three measurement adapters. Pick based on what you want to measure.
| Mode | Adapter | When to use | What it measures |
|---|---|---|---|
| Pure | cortex-http-pure |
Default. Every new quality claim. | Core daemon recall quality only. Zero helpers. Zero pre-/post-processing. Single /recall call per query. |
| Base (deprecated) | cortex-http-base |
Historical comparison only. | Partial helpers. Retained for pre-v0.6.0 continuity. Do not use for new claims. |
| Tuned | cortex-http |
Regression testing only. | Full helper stack. Not a core-quality claim. Shows what's possible with adapter-layer tuning; not representative of the daemon itself. |
bash scripts/benchmark-triad.sh longmemeval-sWrites three JSON result files to benchmarking/results/. Pure is the canonical measurement.
Cortex commits to:
- Local-first -- measurements run against a local daemon; no cloud fallbacks.
- No oracle leakage -- queries never carry metadata the daemon would not see in production.
- No scoring without credentials -- scored runs require explicit answerer/judge API keys.
- No helper inflation -- every public recall-quality score claim comes from
cortex-http-pure.
Five CI gates enforce these in scripts/purity-gates/. CODEOWNERS protects the canonical adapter, CHANGELOG.md, and the gate scripts themselves from silent drift.
benchmarking/adversarial/cas-100.jsonl-- 100-item Cortex Adversarial Suite (15 categories). Wilson 95% CI half-width ±7pp at N=100.benchmarking/judges/triangle.py-- three-judge cross-family protocol (GPT-4o + Claude + local Qwen3-30B via Ollama). Pairwise Cohen's κ + Fleiss' κ.--answerer-familyoverlap refuses to run.
See benchmarking/adversarial/cas-100.spec.md for suite format + usage.
benchmarks.lock.json: pinned external benchmark repos and their target commitssetup-benchmarks.ps1: clone/update the external suites intobenchmarking/tools/tools/: ignored local clones of external benchmark reposadapters/: tracked Cortex adapters and harness glueconfigs/: tracked benchmark configs and run manifestsresults/: tracked summary outputs we want to keep in gitruns/: ignored raw run logs, temp files, datasets, and scratch outputs
agent-memory-benchmark: primary harness for baseline adapter workLongMemEval: long-context memory evaluationlocomo: long conversation memory benchmarkMemoryAgentBench: agent-memory benchmark suite
MemBenchdoes not need a separate clone right now because AMB already has first-class support for it.DMRis still unresolved as a standalone canonical repo.- Until we confirm a better upstream source, treat DMR coverage as something we may implement through the AMB adapter layer instead of cloning an unknown repo.
From repo root:
powershell -ExecutionPolicy Bypass -File benchmarking\setup-benchmarks.ps1That will clone or update the pinned suites into benchmarking/tools/ without polluting git history, because benchmarking/tools/ is ignored.
The first tracked adapter lives in benchmarking/adapters/ and is driven by:
python benchmarking\run_amb_cortex.py smokepython benchmarking\run_amb_cortex.py run --dataset longmemeval --split s --query-limit 20
Important constraints:
- The runner uses an isolated benchmark daemon by default.
- It does not point AMB at the live app daemon, so benchmark data never mixes with real user memory.
- Every run writes
run-manifest.jsoninto its run directory with the Cortex git commit, benchmark tool commits, dataset/mode settings, and whether oracle mode was used. - Every scored run also records the answer/judge provider selected from
OMB_ANSWER_LLM/OMB_JUDGE_LLM, or auto-detected from available API keys. --oracleis allowed only as a diagnostic ceiling. It should not be treated as a headline score.- A real scored run still needs one configured model provider key:
GEMINI_API_KEY,GOOGLE_API_KEY,OPENAI_API_KEY, orGROQ_API_KEY. - Retrieval budget defaults to
300tokens per query (--recall-budget), with strict quality gating enabled by default. - Each run emits
retrieval-metrics.jsonlandgate-report.jsonunder the run directory for auditability. - The default gate mode is
dynamic(--token-gate-mode dynamic), which applies provider-aware token limits (--provider-profile auto|claude|openai|codex|gemini|groq|default). - Saved baseline gates are loaded from
benchmarking/configs/token-gate-baselines.jsonby default:- keyed by provider profile + dataset/split/mode/category scenario
- enforced as non-regression floors/ceilings on top of dynamic profile limits
- ignored only when
--disable-baseline-gatesis set (diagnostics only)
- The quality gate fails the run if:
- accuracy is below
--min-accuracy(default0.90) - token gates fail (dynamic provider profile limits by default, or fixed limits in
absolutemode) - recall token telemetry is missing (unless
--allow-missing-recall-metricsis explicitly set)
- accuracy is below
- Use
--token-gate-mode absolutefor fixed limit enforcement (--max-recall-tokens,--max-avg-recall-tokens). - Use
--token-gate-mode offonly for diagnostics when you explicitly want quality-only gating. - Baselines auto-tighten after passing runs (
--no-auto-tighten-baselineto disable), but only when:- enforcement is enabled
- token gating is on
- query volume meets
--min-queries-for-baseline-update(default20) - run is not scoped to
--query-limit/--query-id
--no-enforce-gateis diagnostics-only and must not be used for headline benchmark claims.