Welcome! This guide gets you from "I just cloned the repo" to a working evaluation report — even if you've never used Azure, Python virtual environments, or Microsoft Foundry before.
Microsoft Foundry Model Router is a service that automatically picks the best AI model for each prompt — for example sending an easy question to a fast, cheap model and a hard one to a smarter, more expensive model. The promise is "comparable quality at a lower average cost and latency."
This repo is the measurement tool that tests whether that promise holds up for your prompts. It:
- Sends a list of prompts to Model Router (the system being measured) and to a baseline model of your choice (e.g. GPT‑5 directly).
- Records every response, how long it took, how many tokens it used, and what it cost.
- Asks a separate judge model (LLM-as-a-judge) to score answer quality on accuracy, completeness, clarity, and helpfulness — including pairwise A/B comparisons with anti-bias dual ordering.
- Produces an HTML dashboard, Markdown report, CSV, and JSON so you can decide whether Model Router is right for your workload.
You can run it three ways, in increasing order of setup:
| Mode | Needs Azure? | Time to first result | What it's for |
|---|---|---|---|
| Demo (Part 1) | ❌ No | ~30 seconds | Explore every chart and output format with mock data before deciding to invest more time |
| Live local eval (Part 2) | ✅ Yes | ~5–10 minutes | Run real prompts through Model Router + a baseline, scored locally with your own judge model |
| Foundry cloud eval (Part 3) | ✅ Yes (+ Foundry project) | ~10–20 minutes | Submit results to Microsoft Foundry's hosted evaluators for managed, reproducible grading |
Tip: Always start with Part 1. It costs nothing, runs offline, and shows you exactly what the live evaluation will produce.
You need:
- Python 3.9 or newer. Check with
python --version(orpython3 --versionon macOS/Linux). - Git, to clone the repo.
- (Parts 2 and 3 only) An Azure subscription with access to the Model Router service and at least one Azure OpenAI deployment to use as a baseline / judge.
A virtual environment (venv) is an isolated folder of Python packages so this project's dependencies don't conflict with anything else on your machine. You only do this once per clone.
Windows — PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1If activation is blocked, run Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned once, then try again.
Windows — Command Prompt (cmd.exe)
python -m venv .venv
.venv\Scripts\activate.batmacOS / Linux
python3 -m venv .venv
source .venv/bin/activateYou'll know it worked when your terminal prompt shows (.venv) at the start. To leave the environment later, just type deactivate.
This generates a mock evaluation report with synthetic data so you can see every chart, metric, and output file the tool produces — without spending a single API call.
Windows (PowerShell):
.\scripts\demo.ps1Linux / macOS:
bash scripts/demo.shAny OS, manually:
pip install -e .
python scripts/generate_sample_report.py --output-dir results/demoThen open results/demo/dashboard.html in your browser.
The script:
- Installed this repo as a Python package (the
pip install -e .step) so thesrc/modules can be imported. - Generated 100 fake prompt/response pairs across 8 task categories.
- Ran them through the same reporting pipeline the live eval uses.
- Wrote a complete output set to
results/demo/.
| File | What it shows |
|---|---|
dashboard.html |
Interactive HTML dashboard with all charts — open this first |
report.md |
Markdown summary of cost, latency, quality, and model distribution |
detailed_results.csv |
Per-prompt breakdown you can open in Excel for further analysis |
results.json |
Machine-readable metrics for scripting/CI |
chart_*.png |
Individual chart images you can paste into slides |
Charts included:
- Cost comparison (router vs baseline)
- Latency comparison (mean, p50, p90, p95, p99)
- Latency distribution histogram
- Per-category latency breakdown
- Token usage breakdown
- Model distribution pie (which models did Model Router pick?)
- Pairwise win rates (quality A/B)
- Absolute score comparison
When the dashboard opens, browse the charts and reports — these are exactly the same artefacts a live run produces, just generated from synthetic numbers.
When you're ready to measure Model Router on your own prompts and Azure resources:
# Copy the template, then fill in your real values
cp .env.example .env # macOS/Linux
copy .env.example .env # WindowsOpen .env and set the four endpoint URLs and API keys (router, baseline, judge, optional Foundry project). The file is in .gitignore, so your secrets won't be committed.
configs/quick_test.yaml— small, fast (~10 prompts) — good first run.configs/default.yaml— full benchmark (100 prompts).configs/foundry.yaml— adds Foundry cloud-eval submission.
datasets/sample_custom.jsonl— 10 generic prompts across 8 categories (smoke test).datasets/zava_custom.jsonl— 25 retail-themed prompts (broader benchmark).- Or supply your own JSONL — see datasets/README.md and docs/how-to-custom-dataset.md.
python scripts/run_eval.py --config configs/quick_test.yaml --dataset datasets/sample_custom.jsonlResults land in results/<run-name>/ with the same files as the demo. The run is checkpointed, so if it's interrupted you can re-run the same command and it'll resume from where it stopped.
For the full walkthrough — environment variables, judge model setup, cost tuning, scaling to thousands of prompts — see docs/how-to-run-live-eval.md.
Microsoft Foundry can run managed, reproducible graders on your evaluation data so quality scores aren't tied to your local machine. After a local run completes:
# 1. Submit your raw results to Foundry for cloud-side grading
python scripts/run_foundry_eval.py --input-dir results/full-eval
# 2. Compare local judge scores with Foundry's grader scores
python scripts/cross_validate.pySee docs/how-to-foundry-eval-sdk.md for Foundry project setup, and docs/faq.md for common errors.
| If you want to… | Read |
|---|---|
| Understand the methodology and why the metrics are designed this way | docs/methodology.md |
| Use your own prompts | docs/how-to-custom-dataset.md |
| Interpret the dashboard | docs/how-to-interpret-results.md |
| Compare two runs (e.g. before/after a model upgrade) | docs/how-to-compare-runs.md |
| Scale to thousands of prompts | docs/how-to-resume-and-scale.md |
| See the architecture | docs/architecture.md |
| Step through the code | WALKTHROUGH.ipynb |
Stuck? Check docs/faq.md or open an issue.