Corbell-AI · himmi-01 · May 6, 2026 · May 6, 2026
diff --git a/.gitignore b/.gitignore
@@ -41,4 +41,16 @@ env/
 history.json
 *.log
 demo_run.sh
-output/
+output/
+
+# UI — Node / Next.js
+ui/frontend/node_modules/
+ui/frontend/.next/
+ui/frontend/.swc/
+ui/frontend/next-env.d.ts
+ui/frontend/tsconfig.tsbuildinfo
+ui/frontend/package-lock.json
+ui/frontend/.env.local
+
+# UI — SQLite database (local data only)
+*.db
diff --git a/README.md b/README.md
@@ -28,51 +28,84 @@ EvalMonkey natively supports evaluating ANY LLM: **AWS Bedrock**, **Azure**, **G
 
 ## 🚀 At a Glance
 - **8 Agent Frameworks natively supported**: CrewAI, LangChain, OpenAI Agents, Microsoft AutoGen, AWS Bedrock, Ollama, Strands, and custom HTTP endpoints.
-- **20 Standard Benchmarks out-of-the-box**: GSM8K, BIG-Bench Hard, HotpotQA, ToxiGen, MT-Bench, MBPP, and more — all categorised by the agent type they target.
+- **19 Standard Benchmarks out-of-the-box**: GSM8K, BIG-Bench Hard, HotpotQA, ToxiGen, MT-Bench, MBPP, and more — all categorised by the agent type they target.
 - **23 Chaos Injections ready to run**: 12 client-side payload mutations + 11 server-side middleware injections — all text-based, no GPU or vision dependencies.
 - **Automatic Eval Asset Generation**: Poor benchmark scores automatically produce `traces.json`, `evals.json`, and `improvement_prompt.md` — one `cat` command away from Claude Code or Cursor.
 
+---
+
 ## ⚡️ Quick Start
 
+### Option A — Let Claude Code or Cursor set it up for you (30 seconds)
+
+Open Claude Code, Cursor, or any AI coding assistant and paste this prompt:
+
+```
+Set up EvalMonkey in my project so I can benchmark my AI agent.
+
+1. Clone https://github.com/Corbell-AI/evalmonkey into a sibling folder
+2. Run: pip install -e . inside that folder
+3. Copy .env.example to .env and ask me which LLM provider I want to use as the benchmark judge (OpenAI, Anthropic, Bedrock, or Ollama) — then fill in the correct key
+4. Run: evalmonkey init --framework <my_framework> --name "My Agent" --port <my_port>
+   Use the framework my agent is built with (crewai / langchain / openai / bedrock / autogen / ollama / strands / custom)
+5. Show me the generated evalmonkey.yaml and ask me to confirm the agent URL and response path are correct
+6. Run a quick smoke test: evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 2
+   to confirm everything is wired up correctly
+7. Then run the real benchmark against my agent: evalmonkey run-benchmark --scenario mmlu --limit 5
+8. Show me the score and explain what it means
+```
+
+> The agent will handle cloning, installing, configuring your `.env`, and running the first benchmark — all without you typing a single command.
+
+---
+
+### Option B — Manual Setup (5 minutes)
+
+**1. Install**
 ```bash
 git clone https://github.com/Corbell-AI/evalmonkey
 cd evalmonkey
 pip install -e .
 ```
 
-**Step 1 — Run this once inside your agent's project folder:**
+**2. Configure your LLM key** (used only as the evaluation judge — never for your agent)
 ```bash
-cd /your/crewai-project       # wherever your agent lives
-evalmonkey init --framework crewai --name "My Research Crew" --port 8000
+cp .env.example .env
 ```
-This auto-generates a pre-filled `evalmonkey.yaml` with the correct request/response format for your framework. Supported: `crewai`, `langchain`, `openai`, `bedrock`, `autogen`, `ollama`, `strands`, `custom`.
+Open `.env` and set **one** of these depending on your LLM provider:
+```bash
+EVAL_MODEL=gpt-4o
+OPENAI_API_KEY=sk-...          # OpenAI
+
+# — OR —
+EVAL_MODEL=anthropic/claude-haiku-4-5
+ANTHROPIC_API_KEY=sk-ant-...   # Anthropic
 
-**Step 2 — Edit the two settings that matter:**
-```yaml
-# evalmonkey.yaml — generated for CrewAI
-agent:
-  name: "My Research Crew"
-  framework: crewai
-  url: http://localhost:8000/chat       # ← where your agent listens
-  request_key: message
-  response_path: reply
+# — OR —
+EVAL_MODEL=bedrock/anthropic.claude-3-haiku-20240307-v1:0
+AWS_ACCESS_KEY_ID=...          # AWS Bedrock
 
-  # ← EvalMonkey will start this for you automatically!
-  # It spawns the process, waits for it to turn on, benchmarks, then stops it.
-  agent_command: "python src/agent.py"  # or: uvicorn src.agent:app --port 8000
-  agent_startup_wait: 3                 # seconds to wait after launch
+# — OR — (no key needed)
+EVAL_MODEL=ollama/llama3       # Local Ollama
+```
 
-eval_model: "gpt-4o"   # ← the LLM used as benchmark judge
+**3. Smoke test with the built-in sample agent** (no agent of your own needed yet)
+```bash
+evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 3
 ```
+You should see 3 samples run and a score printed. ✅
 
-**Step 3 — Run everything. EvalMonkey starts your agent, benchmarks it, then stops it:**
+**4. Point it at your own agent**
 ```bash
-evalmonkey run-benchmark --scenario mmlu
-evalmonkey run-chaos --scenario mmlu --chaos-profile client_prompt_injection
-evalmonkey history --scenario mmlu
+cd /path/to/your/agent/project
+evalmonkey init --framework crewai --name "My Agent" --port 8000
+# Edit the generated evalmonkey.yaml to set your agent's URL and response format
+evalmonkey run-benchmark --scenario mmlu --limit 5
 ```
 
-> EvalMonkey discovers `evalmonkey.yaml` from the **current working directory** — the same convention used by `pytest`, `promptfoo`, and `docker-compose`. Run all commands from your agent's project folder.
+> `evalmonkey.yaml` is discovered from the **current working directory** — same convention as `pytest` and `docker-compose`.
+
+---
 
 
 ## 🤝 Works With Any Agent — No Code Changes Required

diff --git a/evalmonkey/scenarios/standard_benchmarks.py b/evalmonkey/scenarios/standard_benchmarks.py
@@ -101,7 +101,17 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
     Automatically downloads datasets and converts them to standard HTTP scenarios!
     """
     try:
-        from datasets import load_dataset
+        import os
+        # Prevent PyTorch shared-memory multiprocessing on Mac.
+        # Even with streaming=True, HuggingFace datasets can invoke torch_shm_manager
+        # for internal caching — which fails on Mac with "Permission denied".
+        os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
+        os.environ.setdefault("OMP_NUM_THREADS", "1")
+        os.environ.setdefault("HF_DATASETS_OFFLINE", "0")
+
+        from datasets import load_dataset, disable_progress_bar, disable_caching
+        disable_progress_bar()
+        disable_caching()  # prevents torch_shm from being invoked for cache writes
     except ImportError:
         raise ImportError("The 'datasets' library is required to run standard benchmarks. Please run 'pip install datasets'.")
 
@@ -132,7 +142,7 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
     elif benchmark_name.lower() == "xlam":
         # A standard function calling benchmark 
         try:
-            dataset = load_dataset("Salesforce/xlam-function-calling-60k", split="train", streaming=True)
+            dataset = load_dataset("Salesforce/xlam-function-calling-60k", split="train", streaming=True, trust_remote_code=True)
             for idx, item in enumerate(dataset):
                 if idx >= limit:
                     break
@@ -172,20 +182,34 @@ def load_standard_benchmark(benchmark_name: str, limit: int = 5) -> List[EvalSce
                 path, name, split, q_col, a_col = hf_map[benchmark_name.lower()]
                 desc = SUPPORTED_BENCHMARKS[benchmark_name.lower()]["description"]
                 print(f"Loading {benchmark_name} from HuggingFace Datasets ({path})...")
-                dataset = load_dataset(path, name, split=split, streaming=True) if name else load_dataset(path, split=split, streaming=True)
+                dataset = load_dataset(path, name, split=split, streaming=True, trust_remote_code=True) if name else load_dataset(path, split=split, streaming=True, trust_remote_code=True)
                 for idx, item in enumerate(dataset):
                     if idx >= limit:
                         break
 
                     question_text = str(item.get(q_col, "No question"))
+                    expected_answer = str(item.get(a_col, 'Unknown'))
+
                     if benchmark_name.lower() == "mmlu" and "choices" in item:
                         question_text += f"\nChoices: {item['choices']}"
+                        try:
+                            ans_idx = int(expected_answer)
+                            expected_answer = f"Option {ans_idx}: {item['choices'][ans_idx]}"
+                        except (ValueError, IndexError):
+                            pass
+                    elif benchmark_name.lower() == "hella-swag" and "endings" in item:
+                        question_text += f"\nOptions:\n0: {item['endings'][0]}\n1: {item['endings'][1]}\n2: {item['endings'][2]}\n3: {item['endings'][3]}"
+                        try:
+                            ans_idx = int(expected_answer)
+                            expected_answer = f"Option {ans_idx}: {item['endings'][ans_idx]}"
+                        except (ValueError, IndexError):
+                            pass
 
                     scenarios.append(EvalScenario(
                         id=f"{benchmark_name}_{idx}",
                         description=desc,
                         input_payload={"question": question_text},
-                        expected_behavior_rubric=f"Agent MUST deduce or output this answer: {item.get(a_col, 'Unknown')}"
+                        expected_behavior_rubric=f"Agent MUST deduce or output this answer: {expected_answer}"
                     ))
             else:
                 print(f"Dataset mappings for {benchmark_name} are currently stubbed.")

diff --git a/ui/README.md b/ui/README.md
@@ -0,0 +1,57 @@
+
+# EvalMonkey UI
+
+A professional web interface for running benchmarks, chaos tests, and tracking agent reliability over time.
+
+## Quick Start
+
+**Terminal 1 — Backend (FastAPI)**
+```bash
+cd <path-to-evalmonkey>
+cp .env.example .env  # add EVAL_MODEL + your LLM API key
+uvicorn ui.backend.main:app --reload --port 8080
+```
+
+**Terminal 2 — Frontend (Next.js)**
+```bash
+cd <path-to-evalmonkey>/ui/frontend
+npm run dev
+```
+
+Open **http://localhost:3000** in your browser.
+
+---
+
+## Features
+
+| Page | Description |
+|---|---|
+| **Dashboard** | Production Reliability hero, live runs, recent results grid |
+| **New Run** | 3-step wizard: agent URL → benchmark → configure & launch |
+| **Live Run** | SSE-streamed real-time sample results with score rings |
+| **History** | Recharts trend lines, reliability per scenario, all-runs table |
+
+## Architecture
+
+```
+FastAPI backend  →  SQLite (~/.evalmonkey/ui.db)
+     ↕ REST + SSE
+Next.js frontend  →  http://localhost:3000
+```
+
+The `StorageBackend` ABC in `ui/backend/db.py` makes the storage layer swappable — replace `SQLiteBackend` with `PostgresBackend` in a single line.
+
+## Extending Storage
+```python
+# In ui/backend/db.py — implement this ABC:
+class MyBackend(StorageBackend):
+    def save_run(self, run: RunRecord) -> None: ...
+    # ... 5 other methods
+
+# Then in your app startup:
+from ui.backend.db import set_backend
+set_backend(MyBackend())
+```
+
+## CLI — No Impact
+The existing `evalmonkey` CLI continues to work exactly as before. The UI is a completely additive layer — it imports from the same `evalmonkey.*` packages but adds no changes to them.
diff --git a/ui/backend/__init__.py b/ui/backend/__init__.py
@@ -0,0 +1 @@
+# EvalMonkey UI Backend