Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions build/agents/build-your-agent/evals.mdx
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
---
title: 'Evals'
sidebarTitle: 'Evals'

Check warning on line 3 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L3

Did you really mean 'Evals'?
description: 'Test and evaluate your AI Agents with scenario-based evaluations and automated Evaluators'
---

<Info>
**Rollout Status**: Evals is currently being rolled out progressively, starting with Enterprise customers. If you're an Enterprise customer and don't see this feature in your account yet, reach out to your account manager to discuss access.

Check warning on line 8 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L8

Did you really mean 'Evals'?
</Info>

The Evals section is your command center for testing and evaluating AI Agent performance. Located in the **Monitor** tab (next to the Run tab) in the Agent builder, Evals enables you to create Test Suites, define evaluation criteria (Evaluators), run automated evaluations, and monitor ongoing performanceβ€”all without manual testing.

Check warning on line 11 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L11

Did you really mean 'Evals'?

Check warning on line 11 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L11

Did you really mean 'Evals'?

![Evals section showing Test Suites, Evaluators, Runs, and Performance](/images/agent/agent-evals.png)

Check warning on line 13 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L13

Did you really mean 'Evals'?

## What you can do with Evals

Check warning on line 15 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L15

Did you really mean 'Evals'?

<CardGroup cols={3}>
<Card title="Conduct Tests" icon="flask-vial">
Expand All @@ -28,11 +28,11 @@

---

## Evals sections

Check warning on line 31 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L31

Did you really mean 'Evals'?

The Evals section contains five main sections, accessible from the left sidebar:

Check warning on line 33 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L33

Did you really mean 'Evals'?

- **Test Suites** β€” Create and manage groups of Test scenarios for your Agent. Each Test Suite can contain multiple scenarios with different prompts and evaluation criteria.

Check warning on line 35 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

5 settings listed as bullet points β€” consider using a table instead so they're easier to scan. [technical: 5 consecutive bullet items matching **Key**: value or **Key** β€” value pattern]
- **Evaluators** β€” Configure global evaluation criteria that can be applied across any Test Suite or scenario without needing to set them up each time.
- **Runs** β€” View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, credit spend, and creation dates for all past runs.
- **Publish Checks** β€” Configure which Test Suites must pass before your Agent can be published. Set a pass threshold and optionally block publishing if evaluations fail.
Expand Down Expand Up @@ -117,7 +117,7 @@
6. Click **Create Evaluator**

<Note>
When you run a Test scenario, scenario-level Evaluators are always included automatically. You can also add or remove global Evaluators (from the Evaluators tab) before each run, allowing you to mix standard criteria with scenario-specific evaluation rules.
When you run a Test scenario, scenario-level Evaluators are always included automatically. Global Evaluators are not included by default β€” you must explicitly select them in the evaluation modal (Run Test Set, Run Scenario, or Evaluate Selected Tasks) before each run.
</Note>

---
Expand Down Expand Up @@ -223,7 +223,7 @@
You can select specific Test scenarios within a Test Suite to run certain ones at once, or run all scenarios in the Test Suite together. Note that you cannot bulk select and run multiple Test Suites at the same time.

1. Enter a name for the evaluation run (e.g., "Scenario Run - Jan 14, 12:14 PM"). A default name with timestamp is provided.
2. Select which global Evaluators to include in the run β€” you can add or remove global Evaluators before starting. Scenario-level Evaluators are always included automatically.
2. Scenario-level Evaluators are always included automatically. Global Evaluators are not included by default β€” to include them, tick the ones you want under the **Additional global checks** section.
3. Click **Run** to begin. The system will simulate conversations with your Agent based on your scenario prompts and evaluate them with your selected Evaluators.

---
Expand Down Expand Up @@ -292,7 +292,7 @@

The Performance tab also includes:

- **Data points** for the overall score over time

Check warning on line 295 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

4 features listed as bullet points β€” consider using cards instead so they stand out visually. [technical: 4 consecutive bullet items matching **Feature** pattern, use <CardGroup> with <Card> components]
- **Evaluator breakdown** showing individual scoring per Evaluator
- **Graphs** visualizing Evaluator performance trends
- **List of evaluation runs** with score, name, and the ability to view the full conversation
Expand Down Expand Up @@ -355,6 +355,10 @@
You can add as many scenarios as needed to a single Test Suite. Each scenario is evaluated independently and can have its own Evaluators.
</Accordion>

<Accordion title="How many Evaluators can I add to a scenario?">
Each scenario supports up to 10 Evaluators. This applies to scenario-level Evaluators defined within the scenario itself. Global Evaluators added via **Additional global checks** at run time are counted separately.
</Accordion>

<Accordion title="How are credits calculated for evaluations?">
Credits consumed for each scenario are calculated by adding together:
- The Agent task run (the conversation with your Agent)
Expand Down
Loading