docs(evaluation): revamp evals documentation for new eval system#648
Draft
KarthikAvinashFI wants to merge 5 commits intodevfrom
Draft
docs(evaluation): revamp evals documentation for new eval system#648KarthikAvinashFI wants to merge 5 commits intodevfrom
KarthikAvinashFI wants to merge 5 commits intodevfrom
Conversation
Aligns the evaluation docs with the post-revamp platform: three eval types (Agents / LLM-As-A-Judge / Code), three output types (Pass/fail / Scoring / Choices), composite templates, versioning, ground truth, error localization, and updated apply flows for datasets, trace projects (now via Tasks), and simulation. Concepts (rewritten / new): - eval-types: 3-type taxonomy matching the create-page tabs - eval-templates: built-in vs custom, single vs composite, versioning - eval-results: result formats per output type - judge-models: Turing models + bring-your-own - understanding-evaluation: surfaces and how it all fits - output-types (new): Pass/fail, Scoring (label-based), Choices - data-injection (new): the six Context options - composite-evals (new): aggregation functions and child axis - versioning (new): Set as Default, Restore Version, pinning Features (rewritten / new): - custom: full create flow for all 3 types with field reference - evaluate: dataset apply flow + SDK - test-playground (new): four source modes, AI generate - error-localization (new): toggle, run lifecycle, SDK - ground-truth (new): upload, mapping, embedding statuses Surface-specific updates: - observe/features/evals: rewritten around the Tasks page flow (Basic Info / Evaluations / Filters / Scheduling) - quickstart/running-evals-in-simulation: aligned with the 4-step Create a Simulation wizard Eval Groups was removed from docs as the feature is no longer exposed in the main UI navigation. TH-4638
Adds reference pages for built-in evals that were missing documentation (deterministic, statistical, and agent-mode templates). Also fixes Detect Hallucination input requirement.
Adds rows for the freshly-generated reference pages so users can find them from the Built-in Evals catalog.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the evaluation docs in line with the post-revamp platform. Replaces the old four-method taxonomy (LLM as Judge / Deterministic / Statistical Metric / LLM as Ranker) with what the UI actually shows today: Agents, LLM-As-A-Judge, and Code. Adds new concept and feature pages for things that were undocumented (composite evals, versioning, ground truth, error localization, test playground, data injection, output types in their new label-based form). Rewrites the trace and simulation eval guides around the actual Tasks and Create a Simulation flows.
Linear: TH-4638
What changed
Concepts (under
evaluation/concepts/)Rewritten:
eval-types,eval-templates,eval-results,judge-models,understanding-evaluation.New:
output-types,data-injection,composite-evals,versioning.Features (under
evaluation/features/)Rewritten:
custom,evaluate.New:
test-playground,error-localization,ground-truth.Minor:
custom-models(added trace projects to the surfaces list).Surface-specific eval guides (outside
evaluation/)observe/features/evalsrewritten around the Tasks flow (Basic Info / Evaluations / Filters / Scheduling) and theHistorical data/New incoming datarun modes.quickstart/running-evals-in-simulationaligned with the 4-step Create a Simulation wizard (Add simulation details, Choose Scenario(s), Select Evaluations, Summary) and updated mapping fields.Navigation
src/lib/navigation.tsupdated to include the 4 new concept pages and 3 new feature pages in the sidebar.Removed
eval-groups.mdxand all references. The Groups feature is no longer reachable from the main UI navigation (/dashboard/evaluationsrendersEvalsListViewdirectly without the wrapper that has the Groups tab).Style guide compliance
## About; no UI walkthrough screenshots in concept pages.{/* SCREENSHOT NEEDED: ... */}MDX comments. Rungrep -rn "SCREENSHOT NEEDED" src/pages/docs/to list them.agentic_eval/oree/) do not appear in any doc.Verification
Every concrete claim was cross-checked against the live frontend and backend:
EvalCreatePage.jsx,ModelSelector.jsx,OutputTypeConfig.jsx,TestPlayground.jsx,CompositeDetailPanel.jsx,EvalGroundTruthTab.jsx,EvalDetailPage.jsx.TaskConfigPanel.jsx,TaskSchedulingSection.jsx,EvalsTasksViewV2.jsx,TaskListView.jsx.CreateRunTestPage.jsx,TestEvaluationPage.jsx,RunTestsContent.jsx.DevelopBarRightSection.jsx,DevelopEvaluationDrawer.jsx.config-navigation.jsx,routes/sections/dashboard.jsx,ConfigNavData.jsx.model_hub/types.pyand URL routes inmodel_hub/urls.py,sdk/urls.py,tracer/urls.py.Test plan
pnpm audit-links— passes (0 broken nav links, 0 broken content links).pnpm build— passes (all 18 docs render).pnpm dev.{/* SCREENSHOT NEEDED: ... */}placeholders with real screenshots before un-drafting.