Skip to content

ElormCoch/devtools-model-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Comparisons

Single-page tool for loading conversation test cases, viewing model outputs, and recording human ratings.

Evaluator Tool Screenshot

Screenshot: the evaluator web UI used to browse conversations and record human ratings.

Quickstart

  • Download and retrieve model conversations from this sharepoint - Model Comparisons Sharepoint Folder
  • Requires Python 3.10+ and pip.
  • Install dependencies:
    • pip install flask markdown
  • Start the app:
    • cd "evaluator_app"
    • python app.py
  • Open the UI at: http://127.0.0.1:5010/
  • Enter a user Id and begin evals. Results are saved under results/tester-[id]
  • Tester id should be same across sessions else results are saved to a different tester
  • Upload results to sharepoint here - Model Comparison Results

What this project contains

  • evaluator_app/ — Flask UI + API for browsing conversations and saving per-tester ratings.
  • Model Conversations/ — scenario folders containing conversation markdowns used for evaluation.
  • results/ — persisted per-tester ratings (ignored by .gitignore).

How ratings are stored

  • When a tester saves ratings the app writes to results/tester-<sanitized-id>/human_scores.json and human_scores.csv.
  • Use "Load Test Results" to fetch an existing tester's ratings, or "New Tester" to create a new tester id.

Contributing

  • Add or update conversation files under Model Conversations/ here Model Comparisons Sharepoint Folder.
  • Add a short README.md alongside a new conversation folder describing the test site and steps.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published