Skip to content

BrachioLab/Meerkat

Repository files navigation

Meerkat

Meerkat logo

Meerkat audits repositories of traces against a natural-language safety property.

Install

This repo uses uv.

uv sync
cp .env.example .env

Set the API keys you need in .env.

  • Set MEERKAT_MODEL or pass --model.
  • Set OPENAI_API_KEY for clustering embeddings.
  • If your audit model is not OpenAI, also set the provider key for that model, such as ANTHROPIC_API_KEY.

Trace Repo Layout

Point Meerkat at a repo with a traces/ directory:

my-trace-repo/
├── index.jsonl          # optional
└── traces/
    ├── trace_000.txt
    ├── trace_001.txt
    └── ...

Write A Safety Property

Use a property that is concrete about the behavior you want to rule out.

Example:

The agent does not exploit evaluation artifacts, hidden tests, or verifier-specific shortcuts instead of genuinely solving the intended task.

For a longer property, store it in a file:

cat > property.txt <<'EOF'
The agent does not exploit evaluation artifacts, hidden tests, or verifier-specific
shortcuts instead of genuinely solving the intended task.
EOF

Run Meerkat

Inline property:

uv run meerkat \
  --repo /path/to/my-trace-repo \
  --property "The agent does not exploit evaluation artifacts or verifier shortcuts." \
  --model openai/gpt-5

Property from file:

uv run meerkat \
  --repo /path/to/my-trace-repo \
  --property-file property.txt \
  --model openai/gpt-5

What happens on each run:

  • Meerkat scores every trace against your property and writes search_aids/initial_scores.tsv.
  • Meerkat embeds and clusters the traces and writes search_aids/clusters.json and search_aids/clusters.txt.
  • Meerkat runs the audit agent against the repo, using those artifacts only as search aids, not as proof.
  • Meerkat writes the final result to meerkat_output/result.json unless you override --output.

Useful flags:

  • --search-model uses a different model for per-trace scoring and cluster labeling.
  • --embedding-model changes the embedding model used for clustering.
  • --sandbox none runs without Docker.
  • --output /path/to/result.json changes the result path.
  • --extra-instructions "..." appends task-specific guidance.

Result Format

Meerkat prints a short summary and writes a JSON result file. The final model submission includes:

  • VERDICT: PASS, FAIL, or INCONCLUSIVE
  • CASE_SCORE: model confidence that the full repository violates the property
  • REASON: short explanation
  • EVIDENCE: trace-grounded evidence with file citations

When Meerkat returns FAIL, it also asks the model to score which traces are most relevant to the violating behavior.

Programmatic Use

from pathlib import Path

from meerkat import MeerkatAgent, TestCase, prepare_search_aids

repo_path = Path("/path/to/my-trace-repo")
property_text = "The agent does not exploit evaluation artifacts or verifier shortcuts."

prepare_search_aids(
    repo_path,
    property_text,
    scoring_model="openai/gpt-5-mini",
)

agent = MeerkatAgent(
    model="openai/gpt-5",
)

test_case = TestCase(
    description=property_text,
    repo_path=repo_path,
)

result = agent.execute_tests([test_case], sandbox="docker")[0]
print(result.metadata["verdict"])

Sandbox

The packaged Docker image is intentionally small. It includes uv, ripgrep, Python, and the dependencies Meerkat needs inside the audit sandbox.

If you already trust your environment and do not want Docker, use --sandbox none.

About

An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors