ai-evaluation

Here are 248 public repositories matching this topic...

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated May 1, 2026
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

guestrin-lab / deepscholar

Star

build and benchmark deep research

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Mar 28, 2026
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Apr 17, 2026
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 15, 2026
TypeScript

mahmoudrabie / agentic-ai

Star

This repository serves as a comprehensive knowledge hub, curating cutting-edge research papers and developments across 24 specialized domains

Updated Apr 28, 2026

taoAIGC / AICompare

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Mar 4, 2026
JavaScript

ianarawjo / evalstats

Sponsor

Star

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

benchmarking statistical-analysis ai-statistics ai-evaluation prompt-engineering prompt-evaluation ai-evaluation-tools

Updated May 1, 2026
Python

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Vvkmnn / awesome-ai-eval

Star

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

Updated Mar 25, 2026

Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.

evaluations infra ai-evaluation llm-evaluation agent-evaluation

Updated May 1, 2026
Python

ifixai-ai / diagnostic

Star

iFixAi. The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Powered by iMe.

Updated May 1, 2026
Python

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Apr 17, 2026
Python

lechmazur / deception

Star

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.