Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/docs/simulation/search-evals.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions src/lib/navigation.ts
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,10 @@ export const tabNavigation: NavTab[] = [
{ title: 'Understanding Evaluation', href: '/docs/evaluation/concepts/understanding-evaluation' },
{ title: 'Eval Types', href: '/docs/evaluation/concepts/eval-types' },
{ title: 'Eval Templates', href: '/docs/evaluation/concepts/eval-templates' },
{ title: 'Output Types', href: '/docs/evaluation/concepts/output-types' },
{ title: 'Data Injection', href: '/docs/evaluation/concepts/data-injection' },
{ title: 'Composite Evals', href: '/docs/evaluation/concepts/composite-evals' },
{ title: 'Versioning', href: '/docs/evaluation/concepts/versioning' },
{ title: 'Judge Models', href: '/docs/evaluation/concepts/judge-models' },
{ title: 'Eval Results', href: '/docs/evaluation/concepts/eval-results' },
]
Expand All @@ -305,6 +309,9 @@ export const tabNavigation: NavTab[] = [
{ title: 'Built-in Evals', href: '/docs/evaluation/builtin' },
{ title: 'Evaluate via Platform & SDK', href: '/docs/evaluation/features/evaluate' },
{ title: 'Create Custom Evals', href: '/docs/evaluation/features/custom' },
{ title: 'Test Playground', href: '/docs/evaluation/features/test-playground' },
{ title: 'Ground Truth', href: '/docs/evaluation/features/ground-truth' },
{ title: 'Error Localization', href: '/docs/evaluation/features/error-localization' },
{ title: 'Use Custom Models', href: '/docs/evaluation/features/custom-models' },
{ title: 'Future AGI Models', href: '/docs/evaluation/features/futureagi-models' },
{ title: 'Evaluate CI/CD Pipeline', href: '/docs/evaluation/features/cicd' },
Expand Down
53 changes: 53 additions & 0 deletions src/pages/docs/evaluation/builtin/accuracy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Accuracy: Built-in Evaluation"
description: "Computes classification accuracy by comparing predicted labels against expected labels. Accepts single values or JSON arrays of labels. Case-insensitive comp..."
---

Computes classification accuracy by comparing predicted labels against expected labels. Accepts single values or JSON arrays of labels. Case-insensitive comparison.

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="accuracy",
inputs={
"output": "The capital of France is Paris.",
"expected": "Paris"
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"accuracy",
{
output: "The capital of France is Paris.",
expected: "Paris"
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `output` | `string` | The output. |
| | `expected` | `string` | The expected. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns a numeric score between 0 and 1, plus a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `NLP Metrics`, `Output Validation`
53 changes: 53 additions & 0 deletions src/pages/docs/evaluation/builtin/answer-similarity.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Answer Similarity: Built-in Evaluation"
description: "Evaluates the similarity between the expected and actual responses"
---

Evaluates the similarity between the expected and actual responses.

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="answer_similarity",
inputs={
"expected_response": "...",
"response": "..."
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"answer_similarity",
{
expected_response: "...",
response: "..."
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `expected_response` | `string` | The expected response. |
| | `response` | `string` | The response. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns a numeric score between 0 and 1, plus a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `NLP Metrics`, `Output Validation`
50 changes: 50 additions & 0 deletions src/pages/docs/evaluation/builtin/api-call.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Api Call: Built-in Evaluation"
description: "Makes an API call and evaluates the response"
---

Makes an API call and evaluates the response.

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="api_call",
inputs={
"response": "..."
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"api_call",
{
response: "..."
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `response` | `string` | The response. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns `Passed` or `Failed` per row, with a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `Code`, `Output Validation`
53 changes: 53 additions & 0 deletions src/pages/docs/evaluation/builtin/balanced-accuracy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Balanced Accuracy: Built-in Evaluation"
description: "Computes balanced accuracy (average recall per class). Handles imbalanced datasets better than standard accuracy"
---

Computes balanced accuracy (average recall per class). Handles imbalanced datasets better than standard accuracy.

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="balanced_accuracy",
inputs={
"output": "The capital of France is Paris.",
"expected": "Paris"
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"balanced_accuracy",
{
output: "The capital of France is Paris.",
expected: "Paris"
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `output` | `string` | The output. |
| | `expected` | `string` | The expected. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns a numeric score between 0 and 1, plus a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `NLP Metrics`, `Output Validation`
4 changes: 2 additions & 2 deletions src/pages/docs/evaluation/builtin/bleu.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ console.log(result);
| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `reference` | `string` | Model-generated output to be evaluated. |
| | `hypothesis` | `string` or `List[string]` | One or more reference texts. |
| | `reference` | `string` | The reference / ground-truth text the output is being compared against. |
| | `hypothesis` | `string` | The model-generated output being evaluated. |

| **Output** | | |
| ------ | ----- | ----------- |
Expand Down
53 changes: 53 additions & 0 deletions src/pages/docs/evaluation/builtin/character-error-rate.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Character Error Rate: Built-in Evaluation"
description: "Computes Character Error Rate (CER) for ASR/OCR evaluation. CER measures character-level edit distance between reference and hypothesis. Returns 1-CER as sco..."
---

Computes Character Error Rate (CER) for ASR/OCR evaluation. CER measures character-level edit distance between reference and hypothesis. Returns 1-CER as score (higher=better).

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="character_error_rate",
inputs={
"reference": "The capital of France is Paris.",
"hypothesis": "Paris is the capital of France."
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"character_error_rate",
{
reference: "The capital of France is Paris.",
hypothesis: "Paris is the capital of France."
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `reference` | `string` | The reference. |
| | `hypothesis` | `string` | The hypothesis. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns a numeric score between 0 and 1, plus a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `NLP Metrics`, `Audio`
53 changes: 53 additions & 0 deletions src/pages/docs/evaluation/builtin/chrf-score.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Chrf Score: Built-in Evaluation"
description: "Computes ChrF score (character n-gram F-score). More robust than BLEU for morphologically rich languages and short texts. Uses character-level n-grams up to ..."
---

Computes ChrF score (character n-gram F-score). More robust than BLEU for morphologically rich languages and short texts. Uses character-level n-grams up to order 6 with recall-weighted F-score.

<CodeGroup>

```python Python
result = evaluator.evaluate(
eval_templates="chrf_score",
inputs={
"reference": "The capital of France is Paris.",
"hypothesis": "Paris is the capital of France."
},
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
```

```typescript JS/TS
import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
"chrf_score",
{
reference: "The capital of France is Paris.",
hypothesis: "Paris is the capital of France."
}
);

console.log(result);
```

</CodeGroup>

| **Input** | | | |
| ------ | --------- | ---- | ----------- |
| | **Required Input** | **Type** | **Description** |
| | `reference` | `string` | The reference. |
| | `hypothesis` | `string` | The hypothesis. |

| **Output** | | |
| ------ | ----- | ----------- |
| | **Field** | **Description** |
| | **Result** | Returns a numeric score between 0 and 1, plus a reason explaining the verdict. |
| | **Reason** | A plain-language explanation of the verdict. |

**Tags:** `NLP Metrics`, `Text`
Loading
Loading