Evaluation results and human review - enhancement

Changes being made to support human review of evaluation results and to show all test cases being run (rather than just a sample.

**Justification for showing all test cases** 
- currently bulk evaluations already use public prompt datasets
- for user generated prompts, if AI maker wanted to see the prompts - they could find many ways to do so from code. Putting only sample issues on evaluation results is not effective for hiding user generated prompts.
- ParakhAI's goal has shifted from 'audits' to 'participatory (i.e collaborative) evals'. The goal is to find issues and fix them, rather than to test and certify an AI maker.

- [ ] For bulk eval with LLM as judge: AI generated insights need to be editable by evaluator
- [ ] All test cases to be displayed, with their corresponding issues under them in a collapsible table.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation results and human review - enhancement #328

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation results and human review - enhancement #328

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions