runMultiShotOptimization is the public adapter for GEPA-style optimization over
variable-length agent conversations.
Use it when the thing you want to improve is not a single model call. Typical targets are agent system prompts, tool descriptions, routing policies, retrieval plans, or app-specific scaffolding that affects an entire task trajectory.
The primitive is intentionally small. Your app owns the domain logic:
seedVariants: prompt/config/tool-policy candidatesrunner: executes one complete task trajectory for one variantscorer: scores the trajectory and emits actionable side informationmutateAdapter: proposes new variants from top and bottom trials
agent-eval owns the release-critical glue:
- stable paired seeds
- search-split prompt evolution
- cost/score Pareto objectives
- failed-run conversion into failed trials
- ASI projection into reflection traces and numeric metrics
- optional paired holdout gating through
HeldOutGate - validated
RunRecordrows for promotion evidence
The return shape separates discovery from promotion:
searchBestVariant: best variant on the optimizer-visible search scenariossearchBestAggregate: aggregate for that search winnerpromotedVariant: variant callers should shippromotedAggregate: aggregate for the promoted variantgate: holdout decision and evidence, ornullwhen no gate ran
If a holdout gate is configured and rejects the search winner,
promotedVariant is the baseline. Do not ship searchBestVariant directly
unless you intentionally run without a holdout gate.
The scorer should return asi rows for concrete failure modes:
{
expectationId: 'used-primary-sources',
message: 'The final answer cited secondary summaries instead of primary sources.',
severity: 'error',
responsibleSurface: 'retrieval-policy',
suggestion: 'Prefer primary-source domains during source-gathering turns.',
}Standard knowledge-related responsible surfaces are:
knowledge-requirementsdata-acquisitionretrieval-policyuser-question-policy
These rows become:
- reflection expectations via
trialTraceFromMultiShotTrial - aggregate metrics like
asi.errorandsurface.retrieval-policy - trace evidence available to downstream reports
This is the main reason to use this primitive instead of reducing each run to a single scalar reward.
For release gates, configure gate. The first seed variant is the baseline and
gate.gate.baselineKey must match its id.
Holdout scenarios must be disjoint from searchScenarioIds. The adapter runs
baseline and candidate with the same (scenarioId, rep) seed, validates every
row with validateRunRecord, then asks HeldOutGate whether to promote.
When gate.searchScenarioIds is omitted, the adapter reuses
searchScenarioIds for the overfit-gap check.
import {
runMultiShotOptimization,
trialTraceFromMultiShotTrial,
type MultiShotVariant,
} from '@tangle-network/agent-eval'
type Payload = { systemPrompt: string }
const baseline: MultiShotVariant<Payload> = {
id: 'baseline',
label: 'baseline',
generation: 0,
payload: { systemPrompt: currentPrompt },
}
const result = await runMultiShotOptimization<Payload>({
runId: `research-agent-${Date.now()}`,
target: 'research-agent-system-prompt',
seedVariants: [baseline],
searchScenarioIds: searchScenarios.map((s) => s.id),
reps: 2,
generations: 4,
populationSize: 4,
scoreConcurrency: 4,
runner: {
async run({ variant, scenarioId, seed }) {
return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt })
},
},
scorer: {
async score({ run }) {
return scoreFullTrajectory(run.trace)
},
},
mutateAdapter: {
async mutate({ parent, bottomTrials, childCount, generation }) {
const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t))
return proposePromptMutations({ parent, traces, childCount, generation })
},
},
})
deploy(result.promotedVariant.payload)