UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
-
Updated
May 1, 2026 - Python
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
build and benchmark deep research
Ranking LLMs on agentic tasks
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
This repository serves as a comprehensive knowledge hub, curating cutting-edge research papers and developments across 24 specialized domains
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
Code scanner to check for issues in prompts and LLM calls
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications
Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.
iFixAi. The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Powered by iMe.
Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
Make your OpenClaw agents better, cheaper, and faster.
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.
Evaluation Infrastructure for AI Agents
[ACL 2026] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
Example Projects integrated with Future AGI Tech Stack for easy AI development
Running UK AISI's Inspect in the Cloud
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."