Skip to content

Commit 81ce9af

Browse files
rhacs-botmtodor
andauthored
chore(evals): Update model evaluations 2026-05-26 (#135)
Co-authored-by: mtodor <3965286+mtodor@users.noreply.github.com>
1 parent 2e2fac4 commit 81ce9af

1 file changed

Lines changed: 15 additions & 15 deletions

File tree

docs/model-evaluation.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -39,27 +39,27 @@ A task passes when **all** its assertions pass **and** the LLM judge approves th
3939

4040
<!-- model:gpt-5-mini start -->
4141

42-
### gpt-5-mini — 2026-04-21
42+
### gpt-5-mini — 2026-05-26
4343

44-
**Overall: 11/11 tasks passed (100%)**
44+
**Overall: 10/11 tasks passed (90%)**
4545

4646
#### Task Results
4747

4848
| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
4949
|---|------|--------|-----------|----------|----------|--------------|---------------|
50-
| 1 | list-clusters | Pass | Pass | Pass | Pass | 1720 | 634 |
51-
| 2 | cve-detected-workloads | Pass | Pass | Pass | Pass | 565 | 1900 |
52-
| 3 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1759 | 1983 |
53-
| 4 | cve-nonexistent | Pass | Pass | Pass | **Fail** | 2550 | 3087 |
54-
| 5 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 539 | 1032 |
55-
| 6 | cve-cluster-does-not-exist | Pass | **Fail** | Pass | Pass | 504 | 1481 |
56-
| 7 | cve-clusters-general | Pass | Pass | Pass | Pass | 516 | 1692 |
57-
| 8 | cve-cluster-list | Pass | Pass | Pass | Pass | 2530 | 3438 |
58-
| 9 | cve-log4shell | Pass | Pass | Pass | Pass | 2032 | 2593 |
59-
| 10 | cve-multiple | Pass | Pass | Pass | Pass | 2166 | 2588 |
60-
| 11 | rhsa-not-supported | Pass | | Pass | Pass | 1674 | 1429 |
61-
62-
**Total input tokens**: 16555 | **Total output tokens**: 21857
50+
| 1 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1513 | 1506 |
51+
| 2 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 1496 | 1289 |
52+
| 3 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 507 | 1265 |
53+
| 4 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2052 |
54+
| 5 | cve-cluster-list | Pass | Pass | Pass | Pass | 674 | 1682 |
55+
| 6 | rhsa-not-supported | Pass | | Pass | Pass | 1810 | 3098 |
56+
| 7 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 561 | 1506 |
57+
| 8 | cve-detected-workloads | Pass | Pass | Pass | Pass | 539 | 2250 |
58+
| 9 | cve-multiple | Pass | Pass | Pass | **Fail** | 2234 | 3627 |
59+
| 10 | cve-log4shell | Pass | Pass | Pass | Pass | 2245 | 3516 |
60+
| 11 | list-clusters | Pass | Pass | Pass | Pass | 1700 | 607 |
61+
62+
**Total input tokens**: 15067 | **Total output tokens**: 22398
6363

6464
<!-- model:gpt-5-mini end -->
6565

0 commit comments

Comments
 (0)