Skip to content

Evaluation results and human review - enhancement #328

@aparnasr1

Description

@aparnasr1

Changes being made to support human review of evaluation results and to show all test cases being run (rather than just a sample.

Justification for showing all test cases

  • currently bulk evaluations already use public prompt datasets

  • for user generated prompts, if AI maker wanted to see the prompts - they could find many ways to do so from code. Putting only sample issues on evaluation results is not effective for hiding user generated prompts.

  • ParakhAI's goal has shifted from 'audits' to 'participatory (i.e collaborative) evals'. The goal is to find issues and fix them, rather than to test and certify an AI maker.

  • For bulk eval with LLM as judge: AI generated insights need to be editable by evaluator

  • All test cases to be displayed, with their corresponding issues under them in a collapsible table.

Metadata

Metadata

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions