Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 47 additions & 14 deletions src/pages/docs/evaluation/concepts/eval-types.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
---
title: "Eval Types: Four Evaluation Methods in Future AGI"
description: "The four evaluation methods in Future AGI: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker, and how modality affects which ones apply."
title: "Eval Types"
description: "The five evaluation methods in Future AGI: LLM as Judge, LLM as Ranker, Agent as Judge, Deterministic, and Statistical Metric, and how modality affects which ones apply."
---

## About

Every eval template in Future AGI uses one of four evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.
Every eval template in Future AGI uses one of five evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.

Internally these five methods collapse to three canonical types used by the API and DB: `llm` (LLM as Judge, LLM as Ranker), `code` (Deterministic and Statistical Metric), and `agent` (Agent as Judge).

---

Expand Down Expand Up @@ -35,7 +37,15 @@ Computed directly from the text using code or string logic. No model is called a

**Returns**: pass/fail only. No reason field.

**Examples**: Is JSON, Is Email, Contains Valid Link, No Invalid Links, One Line.
**Examples:**

| Category | Templates |
|---|---|
| Format validation | Is JSON, Is Email, Is Code, Is URL, JSON Schema, JSON Validation |
| Substring checks | Contains, Contains Any, Contains All, Contains None, Starts With, Ends With, Equals |
| Length and shape | Length Greater Than, Length Less Than, Length Between, Word Count In Range, One Line |
| Link validation | Contains Valid Link, No Invalid Links |
| Pattern matching | Regex |

**Best for:**
- Format validation (valid JSON, email address, URL presence)
Expand All @@ -59,11 +69,15 @@ Computes a numeric score using an algorithm applied to the output and a referenc
| Levenshtein Similarity | Character edit distance between output and reference |
| Numeric Similarity | Numerical difference between output and reference |
| Embedding Similarity | Semantic vector similarity between output and reference |
| Fuzzy Match | Approximate string match against an expected answer |
| Ground Truth Match | Whether the output matches a reference ground truth |
| Semantic List Contains | Whether output contains phrases semantically similar to a reference list |
| Recall@K, Precision@K, NDCG@K, MRR, Hit Rate | Retrieval quality for RAG pipelines |
| FID Score | Distribution similarity between sets of real and generated images |
| CLIP Score | Alignment between an image and its text description |

Most statistical metrics require a reference value (a ground-truth answer, a target list, or a relevance label set). Provide it through the eval config when running the eval.

**Best for:**
- Benchmarking against a ground-truth reference answer
- RAG retrieval quality (recall, precision, ranking)
Expand All @@ -88,9 +102,27 @@ A variant of LLM as Judge where instead of scoring a single response, the model

---

## Agent as Judge

A specialised evaluation agent runs an iterative loop instead of a single LLM call. It can call tools through MCP connectors, look things up on the internet, retrieve from a knowledge base, and reason over multiple turns before returning a verdict. Use this when a single-shot judge cannot decide on its own because the eval needs external evidence or multi-step verification.

**Requires a judge model.** Tool or MCP connectors and, optionally, a knowledge base must be configured for the evaluator to use.

**Returns**: a result (pass/fail, score, or category) and a plain-language **reason** that can cite the tools and sources consulted during the run.

**Examples**: Custom evals authored as agent evaluators, fact verification with web lookup, knowledge-base-grounded compliance checks.

**Best for:**
- Fact verification that requires up-to-date or external information
- Multi-step policy or compliance checks that a single prompt cannot express
- Evals that should ground judgment in a curated knowledge base
- Higher-confidence judgments where accuracy outweighs speed and cost

---

## Modality

In addition to the four types above, evals also vary by the kind of input they accept:
In addition to the five methods above, evals also vary by the kind of input they accept:

| Modality | What it evaluates | Example evals |
|---|---|---|
Expand All @@ -105,18 +137,19 @@ Multimodal evals (image, audio, conversation) require a judge model that support

## Quick reference

| Type | Judge model required | Returns reason | No API key possible |
|---|---|---|---|
| LLM as Judge | Yes | Yes | No |
| Deterministic | No | No | Yes |
| Statistical Metric | No (most) | No | Yes (most) |
| LLM as Ranker | Yes | No | No |
| Type | Canonical type | Judge model required | Tools / KB | Returns reason | No API key possible |
|---|---|---|---|---|---|
| LLM as Judge | llm | Yes | No | Yes | No |
| LLM as Ranker | llm | Yes | No | No | No |
| Agent as Judge | agent | Yes | Yes | Yes | No |
| Deterministic | code | No | No | No | Yes |
| Statistical Metric | code | No (most) | No | No | Yes (most) |

---

## Next steps

- [Built-in evals](/docs/evaluation/builtin): Full list with evaluation method and required inputs for each template.
- [Create custom evals](/docs/evaluation/features/custom): Custom evals always use LLM as Judge.
- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge and LLM as Ranker evals.
- [Eval groups](/docs/evaluation): Combine different eval types and run them together in one pass.
- [Create custom evals](/docs/evaluation/features/custom): Custom evals can be authored as LLM as Judge or Agent as Judge.
- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge, LLM as Ranker, and Agent as Judge evals.
- [Eval groups](/docs/evaluation/features/groups): Combine different eval types and run them together in one pass.
Loading