A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.
structured-evaluation provides standardized types for evaluation reports, enabling:
- LLM-as-Judge assessments with weighted category scores and severity-based findings
- GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
- Multi-agent coordination with DAG-based report aggregation
go get github.com/agentplexus/structured-evaluation| Package | Description |
|---|---|
evaluation |
EvaluationReport, CategoryScore, Finding, Severity types |
summary |
SummaryReport, TeamSection, TaskResult for GO/NO-GO checks |
combine |
DAG-based report aggregation using Kahn's algorithm |
render/box |
Box-format terminal renderer for summary reports |
render/detailed |
Detailed terminal renderer for evaluation reports |
schema |
JSON Schema generation and embedding |
For subjective quality assessments with detailed findings:
import "github.com/agentplexus/structured-evaluation/evaluation"
report := evaluation.NewEvaluationReport("prd", "document.md")
report.AddCategory(evaluation.NewCategoryScore("problem_definition", 0.20, 8.5, "Clear problem statement"))
report.AddFinding(evaluation.Finding{
Severity: evaluation.SeverityMedium,
Category: "metrics",
Title: "Missing baseline metrics",
Recommendation: "Add current baseline measurements",
})
report.Finalize("sevaluation check document.md")For deterministic checks with pass/fail status:
import "github.com/agentplexus/structured-evaluation/summary"
report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
ID: "qa",
Name: "Quality Assurance",
Tasks: []summary.TaskResult{
{ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
{ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
},
})Following InfoSec conventions:
| Severity | Icon | Blocking | Description |
|---|---|---|---|
| Critical | π΄ | Yes | Must fix before approval |
| High | π΄ | Yes | Must fix before approval |
| Medium | π‘ | No | Should fix, tracked |
| Low | π’ | No | Nice to fix |
| Info | βͺ | No | Informational only |
Default criteria (zero blocking findings, minimum score):
criteria := evaluation.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), MinScore: 7.0
criteria := evaluation.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, MinScore: 8.0# Install
go install github.com/agentplexus/structured-evaluation/cmd/sevaluation@latest
# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=box
sevaluation render report.json --format=json
# Check pass/fail (exit code 0/1)
sevaluation check report.json
# Validate structure
sevaluation validate report.json
# Generate JSON Schema
sevaluation schema generate -o ./schema/For multi-agent workflows with dependencies:
import "github.com/agentplexus/structured-evaluation/combine"
results := []combine.AgentResult{
{TeamID: "qa", Tasks: qaTasks},
{TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
{TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}
report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa β security β releaseSchemas are embedded for runtime validation:
import "github.com/agentplexus/structured-evaluation/schema"
evalSchema := schema.EvaluationSchemaJSON
summarySchema := schema.SummarySchemaJSONDefine explicit scoring criteria for consistent evaluations:
rubric := evaluation.NewRubric("quality", "Output quality").
AddRangeAnchor(8, 10, "Excellent", "Near perfect").
AddRangeAnchor(5, 7.9, "Good", "Acceptable").
AddRangeAnchor(0, 4.9, "Poor", "Needs work")
// Use default PRD rubric
rubricSet := evaluation.DefaultPRDRubricSet()Track LLM judge configuration for reproducibility:
judge := evaluation.NewJudgeMetadata("claude-3-opus").
WithProvider("anthropic").
WithPrompt("prd-eval-v1", "1.0").
WithTemperature(0.0).
WithTokenUsage(1500, 800)
report.SetJudge(judge)Compare two outputs instead of absolute scoring:
comparison := evaluation.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(evaluation.WinnerA, "A is more accurate", 0.9)
// Aggregate multiple comparisons
result := evaluation.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinnerCombine evaluations from multiple judges:
result := evaluation.AggregateEvaluations(evaluations, evaluation.AggregationMean)
// Methods: AggregationMean, AggregationMedian, AggregationConservative, AggregationMajority
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decisionExport evaluations to Opik, Phoenix, or Langfuse:
import "github.com/agentplexus/omniobserve/integrations/sevaluation"
// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)Designed to work with:
github.com/agentplexus/omniobserve- LLM observability (Opik, Phoenix, Langfuse)github.com/grokify/structured-requirements- PRD evaluation templatesgithub.com/agentplexus/multi-agent-spec- Agent coordinationgithub.com/grokify/structured-changelog- Release validation
MIT License - see LICENSE for details.