Architecture

Technical design, component diagram, and extension points for the Model Router Evaluation toolkit.

Audience: This page is for contributors who want to read or extend the source code, not for users running evaluations. If you just want to run an evaluation, the QUICKSTART, how-to-run-live-eval.md, and how-to-interpret-results.md cover everything you need.

Component diagram

┌──────────────────────────────────────────────────────────────────┐
│                        scripts/run_eval.py                       │
│                       (CLI entry point)                          │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│                        src/runner.py                              │
│                    (Evaluation orchestrator)                      │
│                                                                  │
│  1. Load dataset        ──▶ src/dataset.py                       │
│  2. Load checkpoint     ──▶ checkpoint_eval.jsonl (if --resume)  │
│  3. Run eval prompts    ──▶ src/client.py  ──▶ Azure OpenAI API  │
│  4. Compute metrics     ──▶ src/metrics.py                       │
│  5. Load judge ckpt     ──▶ checkpoint_judge.jsonl (if --resume) │
│  6. Run judge           ──▶ src/judge.py   ──▶ Azure OpenAI API  │
│  7. Generate report     ──▶ src/report.py + src/charts.py        │
│                              + src/dashboard.py                  │
└──────────────────────────────────────────────────────────────────┘

Module Reference

Module	Responsibility
`config.py`	Load YAML config, substitute `${ENV_VAR}` references, validate
`dataset.py`	Load JSONL datasets, validate schema, optional sampling
`client.py`	`EvalClient` — async API calls with semaphore, retry, latency capture
`runner.py`	Orchestrator — checkpoint/resume, signal handling, progress bars
`judge.py`	`Judge` — pairwise (dual-ordering) + absolute scoring, anti-bias
`metrics.py`	Aggregate cost, latency, quality, model distribution
`charts.py`	Generate matplotlib charts (8 chart types)
`dashboard.py`	Generate self-contained HTML dashboard with embedded charts
`report.py`	Generate Markdown, CSV, and JSON outputs

Data Flow

JSONL dataset
    │
    ▼
┌─────────┐     ┌─────────────────┐     ┌──────────────┐
│ Prompts │────▶│ EvalClient      │────▶│ CompletionResult │──┐
│ (list)  │     │ (router+base)   │     │ (per endpoint)   │  │
└─────────┘     └─────────────────┘     └──────────────────┘  │
                                                               │
                ┌─────────────────┐     ┌──────────────┐      │
                │ Judge           │────▶│ JudgeResult   │──┐   │
                │ (4 calls/prompt)│     │ (per prompt)  │  │   │
                └─────────────────┘     └──────────────┘  │   │
                                                          │   │
                ┌─────────────────────────────────────────┘   │
                │                                             │
                ▼                                             ▼
         ┌──────────┐                               ┌──────────────┐
         │ Quality  │                               │ Cost/Latency │
         │ Metrics  │                               │ Metrics      │
         └────┬─────┘                               └──────┬───────┘
              │                                            │
              └──────────────┬─────────────────────────────┘
                             │
                             ▼
                    ┌────────────────┐
                    │ EvalMetrics    │
                    │ (combined)     │
                    └────────┬───────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
         report.md     dashboard.html   results.json
         results.csv   chart_*.png      raw_results.jsonl

Key Design Decisions

Sequential per-prompt, parallel across prompts

Each prompt calls router then baseline sequentially to ensure fair latency comparison (no interference from concurrent requests to the same endpoint). Multiple prompts run in parallel via asyncio.as_completed().

Semaphore-based concurrency

A single asyncio.Semaphore controls how many API calls are in flight simultaneously. This is simpler and more predictable than a worker pool, and integrates naturally with async/await.

Checkpoint as append-only JSONL

Results are appended one line at a time with flush(). This is crash-safe — even a power loss loses at most one in-flight result. The tradeoff is that checkpoint files grow linearly; they're cleaned up on successful completion.

Signal-based graceful shutdown

SIGINT is caught by a signal handler that sets an asyncio.Event. The event loop checks this between prompts, allowing in-flight requests to complete naturally. This avoids the complexity of cancellation and ensures data integrity.

Dual-ordering for pairwise judge

Position bias (preferring response A because it appears first) is a known issue with LLM judges. Running each comparison twice with swapped order and requiring agreement for a non-tie verdict eliminates this bias at the cost of 2x judge calls.

Extension Points

Want to...	Where to look
Add a new chart type	`src/charts.py` — add a function, register in `generate_charts()`
Add a new metric	`src/metrics.py` — add to `EndpointMetrics` or `EvalMetrics`
Support a new API provider	`src/client.py` — add a branch in `_build_client()`
Change judge scoring	`configs/judge_prompts/*.yaml` — edit the prompt templates
Add a new report format	`src/report.py` — add alongside `_generate_markdown()` etc.
Change the dashboard layout	`src/dashboard.py` — edit the HTML template

Foundry Cloud Evaluation (Optional)

An optional post-processing layer submits evaluation results to Microsoft Foundry for cloud-based grading.

┌──────────────────────────────────────────────────────────────┐
│  Original Pipeline (unchanged)                                │
│  configs/*.yaml → measure endpoints → raw_results.jsonl       │
└──────────────────────────┬───────────────────────────────────┘
                           │ raw_results.jsonl + results.json
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  src/foundry/ (new, optional)                                 │
│  transformer → client → graders/custom_evaluators →           │
│  runner → report                                              │
│  ↕ Microsoft Foundry cloud (eval runs, portal)                 │
└──────────────────────────────────────────────────────────────┘

Module	Responsibility
`foundry/config.py`	Load `configs/foundry.yaml`, env var substitution
`foundry/transformer.py`	Pair router+baseline per prompt into Foundry JSONL
`foundry/client.py`	`FoundryEvalClient` wrapping `AIProjectClient`
`foundry/graders.py`	Build `score_model` testing criteria from YAML templates
`foundry/custom_evaluators.py`	Register code-based evaluators for cost & latency
`foundry/runner.py`	End-to-end orchestrator
`foundry/report.py`	Markdown + JSON report from Foundry results

Design principle: Zero impact on the core evaluation flow. No Foundry imports in existing code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Component diagram

Module Reference

Data Flow

Key Design Decisions

Sequential per-prompt, parallel across prompts

Semaphore-based concurrency

Checkpoint as append-only JSONL

Signal-based graceful shutdown

Dual-ordering for pairwise judge

Extension Points

Foundry Cloud Evaluation (Optional)

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

Architecture

Component diagram

Module Reference

Data Flow

Key Design Decisions

Sequential per-prompt, parallel across prompts

Semaphore-based concurrency

Checkpoint as append-only JSONL

Signal-based graceful shutdown

Dual-ordering for pairwise judge

Extension Points

Foundry Cloud Evaluation (Optional)