You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Changes being made to support human review of evaluation results and to show all test cases being run (rather than just a sample.
Justification for showing all test cases
currently bulk evaluations already use public prompt datasets
for user generated prompts, if AI maker wanted to see the prompts - they could find many ways to do so from code. Putting only sample issues on evaluation results is not effective for hiding user generated prompts.
ParakhAI's goal has shifted from 'audits' to 'participatory (i.e collaborative) evals'. The goal is to find issues and fix them, rather than to test and certify an AI maker.
For bulk eval with LLM as judge: AI generated insights need to be editable by evaluator
All test cases to be displayed, with their corresponding issues under them in a collapsible table.
Changes being made to support human review of evaluation results and to show all test cases being run (rather than just a sample.
Justification for showing all test cases
currently bulk evaluations already use public prompt datasets
for user generated prompts, if AI maker wanted to see the prompts - they could find many ways to do so from code. Putting only sample issues on evaluation results is not effective for hiding user generated prompts.
ParakhAI's goal has shifted from 'audits' to 'participatory (i.e collaborative) evals'. The goal is to find issues and fix them, rather than to test and certify an AI maker.
For bulk eval with LLM as judge: AI generated insights need to be editable by evaluator
All test cases to be displayed, with their corresponding issues under them in a collapsible table.