Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,16 @@ env/
history.json
*.log
demo_run.sh
output/
output/

# UI — Node / Next.js
ui/frontend/node_modules/
ui/frontend/.next/
ui/frontend/.swc/
ui/frontend/next-env.d.ts
ui/frontend/tsconfig.tsbuildinfo
ui/frontend/package-lock.json
ui/frontend/.env.local

# UI — SQLite database (local data only)
*.db
81 changes: 57 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,51 +28,84 @@ EvalMonkey natively supports evaluating ANY LLM: **AWS Bedrock**, **Azure**, **G

## 🚀 At a Glance
- **8 Agent Frameworks natively supported**: CrewAI, LangChain, OpenAI Agents, Microsoft AutoGen, AWS Bedrock, Ollama, Strands, and custom HTTP endpoints.
- **20 Standard Benchmarks out-of-the-box**: GSM8K, BIG-Bench Hard, HotpotQA, ToxiGen, MT-Bench, MBPP, and more — all categorised by the agent type they target.
- **19 Standard Benchmarks out-of-the-box**: GSM8K, BIG-Bench Hard, HotpotQA, ToxiGen, MT-Bench, MBPP, and more — all categorised by the agent type they target.
- **23 Chaos Injections ready to run**: 12 client-side payload mutations + 11 server-side middleware injections — all text-based, no GPU or vision dependencies.
- **Automatic Eval Asset Generation**: Poor benchmark scores automatically produce `traces.json`, `evals.json`, and `improvement_prompt.md` — one `cat` command away from Claude Code or Cursor.

---

## ⚡️ Quick Start

### Option A — Let Claude Code or Cursor set it up for you (30 seconds)

Open Claude Code, Cursor, or any AI coding assistant and paste this prompt:

```
Set up EvalMonkey in my project so I can benchmark my AI agent.

1. Clone https://github.com/Corbell-AI/evalmonkey into a sibling folder
2. Run: pip install -e . inside that folder
3. Copy .env.example to .env and ask me which LLM provider I want to use as the benchmark judge (OpenAI, Anthropic, Bedrock, or Ollama) — then fill in the correct key
4. Run: evalmonkey init --framework <my_framework> --name "My Agent" --port <my_port>
Use the framework my agent is built with (crewai / langchain / openai / bedrock / autogen / ollama / strands / custom)
5. Show me the generated evalmonkey.yaml and ask me to confirm the agent URL and response path are correct
6. Run a quick smoke test: evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 2
to confirm everything is wired up correctly
7. Then run the real benchmark against my agent: evalmonkey run-benchmark --scenario mmlu --limit 5
8. Show me the score and explain what it means
```

> The agent will handle cloning, installing, configuring your `.env`, and running the first benchmark — all without you typing a single command.

---

### Option B — Manual Setup (5 minutes)

**1. Install**
```bash
git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .
```

**Step 1 — Run this once inside your agent's project folder:**
**2. Configure your LLM key** (used only as the evaluation judge — never for your agent)
```bash
cd /your/crewai-project # wherever your agent lives
evalmonkey init --framework crewai --name "My Research Crew" --port 8000
cp .env.example .env
```
This auto-generates a pre-filled `evalmonkey.yaml` with the correct request/response format for your framework. Supported: `crewai`, `langchain`, `openai`, `bedrock`, `autogen`, `ollama`, `strands`, `custom`.
Open `.env` and set **one** of these depending on your LLM provider:
```bash
EVAL_MODEL=gpt-4o
OPENAI_API_KEY=sk-... # OpenAI

# — OR —
EVAL_MODEL=anthropic/claude-haiku-4-5
ANTHROPIC_API_KEY=sk-ant-... # Anthropic

**Step 2 — Edit the two settings that matter:**
```yaml
# evalmonkey.yaml — generated for CrewAI
agent:
name: "My Research Crew"
framework: crewai
url: http://localhost:8000/chat # ← where your agent listens
request_key: message
response_path: reply
# — OR —
EVAL_MODEL=bedrock/anthropic.claude-3-haiku-20240307-v1:0
AWS_ACCESS_KEY_ID=... # AWS Bedrock

# ← EvalMonkey will start this for you automatically!
# It spawns the process, waits for it to turn on, benchmarks, then stops it.
agent_command: "python src/agent.py" # or: uvicorn src.agent:app --port 8000
agent_startup_wait: 3 # seconds to wait after launch
# — OR — (no key needed)
EVAL_MODEL=ollama/llama3 # Local Ollama
```

eval_model: "gpt-4o" # ← the LLM used as benchmark judge
**3. Smoke test with the built-in sample agent** (no agent of your own needed yet)
```bash
evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 3
```
You should see 3 samples run and a score printed. ✅

**Step 3 — Run everything. EvalMonkey starts your agent, benchmarks it, then stops it:**
**4. Point it at your own agent**
```bash
evalmonkey run-benchmark --scenario mmlu
evalmonkey run-chaos --scenario mmlu --chaos-profile client_prompt_injection
evalmonkey history --scenario mmlu
cd /path/to/your/agent/project
evalmonkey init --framework crewai --name "My Agent" --port 8000
# Edit the generated evalmonkey.yaml to set your agent's URL and response format
evalmonkey run-benchmark --scenario mmlu --limit 5
```

> EvalMonkey discovers `evalmonkey.yaml` from the **current working directory** — the same convention used by `pytest`, `promptfoo`, and `docker-compose`. Run all commands from your agent's project folder.
> `evalmonkey.yaml` is discovered from the **current working directory** — same convention as `pytest` and `docker-compose`.

---


## 🤝 Works With Any Agent — No Code Changes Required
Expand Down
32 changes: 28 additions & 4 deletions evalmonkey/scenarios/standard_benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,17 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
Automatically downloads datasets and converts them to standard HTTP scenarios!
"""
try:
from datasets import load_dataset
import os
# Prevent PyTorch shared-memory multiprocessing on Mac.
# Even with streaming=True, HuggingFace datasets can invoke torch_shm_manager
# for internal caching — which fails on Mac with "Permission denied".
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("HF_DATASETS_OFFLINE", "0")

from datasets import load_dataset, disable_progress_bar, disable_caching
disable_progress_bar()
disable_caching() # prevents torch_shm from being invoked for cache writes
except ImportError:
raise ImportError("The 'datasets' library is required to run standard benchmarks. Please run 'pip install datasets'.")

Expand Down Expand Up @@ -132,7 +142,7 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
elif benchmark_name.lower() == "xlam":
# A standard function calling benchmark
try:
dataset = load_dataset("Salesforce/xlam-function-calling-60k", split="train", streaming=True)
dataset = load_dataset("Salesforce/xlam-function-calling-60k", split="train", streaming=True, trust_remote_code=True)
for idx, item in enumerate(dataset):
if idx >= limit:
break
Expand Down Expand Up @@ -172,20 +182,34 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
path, name, split, q_col, a_col = hf_map[benchmark_name.lower()]
desc = SUPPORTED_BENCHMARKS[benchmark_name.lower()]["description"]
print(f"Loading {benchmark_name} from HuggingFace Datasets ({path})...")
dataset = load_dataset(path, name, split=split, streaming=True) if name else load_dataset(path, split=split, streaming=True)
dataset = load_dataset(path, name, split=split, streaming=True, trust_remote_code=True) if name else load_dataset(path, split=split, streaming=True, trust_remote_code=True)
for idx, item in enumerate(dataset):
if idx >= limit:
break

question_text = str(item.get(q_col, "No question"))
expected_answer = str(item.get(a_col, 'Unknown'))

if benchmark_name.lower() == "mmlu" and "choices" in item:
question_text += f"\nChoices: {item['choices']}"
try:
ans_idx = int(expected_answer)
expected_answer = f"Option {ans_idx}: {item['choices'][ans_idx]}"
except (ValueError, IndexError):
pass
elif benchmark_name.lower() == "hella-swag" and "endings" in item:
question_text += f"\nOptions:\n0: {item['endings'][0]}\n1: {item['endings'][1]}\n2: {item['endings'][2]}\n3: {item['endings'][3]}"
try:
ans_idx = int(expected_answer)
expected_answer = f"Option {ans_idx}: {item['endings'][ans_idx]}"
except (ValueError, IndexError):
pass

scenarios.append(EvalScenario(
id=f"{benchmark_name}_{idx}",
description=desc,
input_payload={"question": question_text},
expected_behavior_rubric=f"Agent MUST deduce or output this answer: {item.get(a_col, 'Unknown')}"
expected_behavior_rubric=f"Agent MUST deduce or output this answer: {expected_answer}"
))
else:
print(f"Dataset mappings for {benchmark_name} are currently stubbed.")
Expand Down
57 changes: 57 additions & 0 deletions ui/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@

# EvalMonkey UI

A professional web interface for running benchmarks, chaos tests, and tracking agent reliability over time.

## Quick Start

**Terminal 1 — Backend (FastAPI)**
```bash
cd <path-to-evalmonkey>
cp .env.example .env # add EVAL_MODEL + your LLM API key
uvicorn ui.backend.main:app --reload --port 8080
```

**Terminal 2 — Frontend (Next.js)**
```bash
cd <path-to-evalmonkey>/ui/frontend
npm run dev
```

Open **http://localhost:3000** in your browser.

---

## Features

| Page | Description |
|---|---|
| **Dashboard** | Production Reliability hero, live runs, recent results grid |
| **New Run** | 3-step wizard: agent URL → benchmark → configure & launch |
| **Live Run** | SSE-streamed real-time sample results with score rings |
| **History** | Recharts trend lines, reliability per scenario, all-runs table |

## Architecture

```
FastAPI backend → SQLite (~/.evalmonkey/ui.db)
↕ REST + SSE
Next.js frontend → http://localhost:3000
```

The `StorageBackend` ABC in `ui/backend/db.py` makes the storage layer swappable — replace `SQLiteBackend` with `PostgresBackend` in a single line.

## Extending Storage
```python
# In ui/backend/db.py — implement this ABC:
class MyBackend(StorageBackend):
def save_run(self, run: RunRecord) -> None: ...
# ... 5 other methods

# Then in your app startup:
from ui.backend.db import set_backend
set_backend(MyBackend())
```

## CLI — No Impact
The existing `evalmonkey` CLI continues to work exactly as before. The UI is a completely additive layer — it imports from the same `evalmonkey.*` packages but adds no changes to them.
1 change: 1 addition & 0 deletions ui/backend/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# EvalMonkey UI Backend
Loading
Loading