Single-page tool for loading conversation test cases, viewing model outputs, and recording human ratings.
Screenshot: the evaluator web UI used to browse conversations and record human ratings.
Quickstart
- Download and retrieve model conversations from this sharepoint - Model Comparisons Sharepoint Folder
- Requires Python 3.10+ and pip.
- Install dependencies:
- pip install flask markdown
- Start the app:
- cd "evaluator_app"
- python app.py
- Open the UI at: http://127.0.0.1:5010/
- Enter a user Id and begin evals. Results are saved under results/tester-[id]
- Tester id should be same across sessions else results are saved to a different tester
- Upload results to sharepoint here - Model Comparison Results
What this project contains
evaluator_app/— Flask UI + API for browsing conversations and saving per-tester ratings.Model Conversations/— scenario folders containing conversation markdowns used for evaluation.results/— persisted per-tester ratings (ignored by .gitignore).
How ratings are stored
- When a tester saves ratings the app writes to
results/tester-<sanitized-id>/human_scores.jsonandhuman_scores.csv. - Use "Load Test Results" to fetch an existing tester's ratings, or "New Tester" to create a new tester id.
Contributing
- Add or update conversation files under
Model Conversations/here Model Comparisons Sharepoint Folder. - Add a short
README.mdalongside a new conversation folder describing the test site and steps.
