Model Comparisons

Single-page tool for loading conversation test cases, viewing model outputs, and recording human ratings.

Screenshot: the evaluator web UI used to browse conversations and record human ratings.

Quickstart

Download and retrieve model conversations from this sharepoint - Model Comparisons Sharepoint Folder
Requires Python 3.10+ and pip.
Install dependencies:
- pip install flask markdown
Start the app:
- cd "evaluator_app"
- python app.py
Open the UI at: http://127.0.0.1:5010/
Enter a user Id and begin evals. Results are saved under results/tester-[id]
Tester id should be same across sessions else results are saved to a different tester
Upload results to sharepoint here - Model Comparison Results

What this project contains

evaluator_app/ — Flask UI + API for browsing conversations and saving per-tester ratings.
Model Conversations/ — scenario folders containing conversation markdowns used for evaluation.
results/ — persisted per-tester ratings (ignored by .gitignore).

How ratings are stored

When a tester saves ratings the app writes to results/tester-<sanitized-id>/human_scores.json and human_scores.csv.
Use "Load Test Results" to fetch an existing tester's ratings, or "New Tester" to create a new tester id.

Contributing

Add or update conversation files under Model Conversations/ here Model Comparisons Sharepoint Folder.
Add a short README.md alongside a new conversation folder describing the test site and steps.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluator_app		evaluator_app
tools		tools
.gitignore		.gitignore
README.md		README.md
evaluator_tool_image.jpeg		evaluator_tool_image.jpeg

Provide feedback