feat: cheap-model cost advisor for the eval role#59
Merged
Conversation
A user with Hermes set to claude-opus-4-5 silently inherits Opus for the eval + judge roles, where ~100 calls per evolution makes a 5x-cheaper sibling like claude-haiku-4-5 the obvious choice. Today they have to read docs/model_resolution.md end-to-end to discover the per-role override pattern. Adds a small advisor module that consults litellm.model_cost (a 2700+ entry curated catalog shipped with LiteLLM), enumerates same-provider candidates that are strictly cheaper for input tokens with at least the current context window, and returns the cheapest qualifying option. Lookup tolerates both prefixed (openrouter/openai/gpt-5) and bare (claude-opus-4-5 for Anthropic-direct) catalog key shapes. Returns None for unknown models — Bedrock, Codex, and local-server endpoints aren't in the catalog and skip gracefully rather than guess. Cross-provider swaps are intentionally excluded; staying within the user's already-configured provider keeps the suggestion paste-ready for --eval-model without a second credential setup. Renders a Rich panel mirroring the existing operator-facing panel style, surfacing the absolute cost-per-1M for both models, the cost ratio, the context window, and a paste-ready CLI flag — plus a one- line caveat about why the optimizer/reflection roles are intentionally left out.
…suggest
Surfaces a Rich panel after the preflight probe in both evolve_skill and
evolve_tool when the user inherited the eval model from Hermes (i.e.,
--eval-model is unset) and a strictly cheaper same-provider alternative
exists in litellm.model_cost. The panel shows absolute cost-per-1M for
both models, the cost ratio, the context window, and a paste-ready
--eval-model flag — plus a one-line caveat that the optimizer/reflection
roles are intentionally untouched (reasoning quality matters there).
Skip rules:
* --eval-model is explicit -> user already chose; respect it
* --no-cost-suggest -> explicit operator opt-out
* --dry-run -> matches existing preflight skip-on-dry-run
* Resolved model not in the LiteLLM catalog (Bedrock, Codex, local) ->
advisor returns None and no panel renders
The eval-LM resolve is hoisted out of the preflight conditional so the
advisor and the preflight share one ResolvedLM rather than re-walking
config.yaml + auth.json. Adds ~5ms of file I/O on the --no-preflight path
that previously deferred resolution to GEPA setup; trade is worth it for
cleaner control flow.
docs/model_resolution.md "Cost considerations" section gets a third
bullet pointing at the advisor + the --no-cost-suggest escape hatch.
Spike-revealed regression: a user on claude-opus-4-5 was being suggested claude-3-haiku-20240307 (March 2024, gen 3) at 20× cheaper, because pure input-cost-sort wins over the current-gen claude-haiku-4-5 at 5× cheaper. A user pasting that flag would silently downgrade to a 20-month-old gen-3 model with noticeably weaker evaluations — the opposite of the helpful nudge the advisor is supposed to be. Adds a (major, minor) version parser tolerant of real catalog naming: claude-opus-4-5 -> (4, 5) claude-opus-4-5-20251101 -> (4, 5) date suffix ignored claude-3-opus-20240229 -> (3, 0) claude-4-sonnet-20250514 -> (4, 0) digit not followed by -minor claude-sonnet-4-6 -> (4, 6) custom-local-model -> (None, 0) unparseable; degrades open Filter rule: same major version as the current model. Unparseable major treated as "match all" so custom/local-server names don't silently lose all suggestions. Sort key adds minor desc as a tiebreaker so when claude-sonnet-4-6 and claude-4-sonnet-20250514 both score $3/M with 1M context, the canonical newer-named model wins. Effective change vs. real catalog: opus-4-5 -> haiku-4-5 (was: 3-haiku-20240307) fix opus-4-7 -> sonnet-4-6 (was: 4-sonnet-20250514) cleaner opus-4-1 -> haiku-4-5 (was: 3-haiku-20240307) fix 3-opus-* -> 3-haiku-* (unchanged — gen-3 user gets gen-3) The strict context-window filter is unchanged. opus-4-7 (1M context) users still see only sonnet-4-6 (1M, 1.7× cheaper), not haiku-4-5 (200k, 5× cheaper). Relaxing context would need a "minimum eval-workload context" knob; out of scope here.
…x skip
End-to-end CLI smoke against the real LiteLLM catalog surfaced three
edge cases the v2 cost advisor handled wrong:
1. Bedrock cross-region: a user on bedrock/us.anthropic.claude-sonnet-4-6
was being suggested bedrock/anthropic.claude-sonnet-4-6 (regional-only).
The user picked the us.* cross-region inference profile deliberately
for failover/throughput; suggesting the regional version silently
reverts that choice.
2. OpenRouter cross-vendor: a user on openrouter/anthropic/claude-opus-4
was being suggested openrouter/z-ai/glm-4.7-flash — same litellm
provider (openrouter) but completely different upstream model vendor.
3. Marginal Bedrock pricing: us.anthropic.claude-sonnet-4-6 ($3.30/M)
was being suggested over us.anthropic.claude-sonnet-4-20250514-v1:0
($3.00/M), a date-suffixed older snapshot at 1.1× savings — too small
to be worth interrupting the user for.
Plus a fourth case the smoke confirmed needed handling: Codex users get
a CodexLM via the resolver's lm_factory hook, but the model string is
still openai/gpt-5-codex. The advisor would suggest openai/gpt-5-nano,
which implies a different auth setup (OPENAI_API_KEY) the Codex user
deliberately didn't opt in to.
Three fixes:
* _namespace(catalog_key) extracts a routing-namespace path. Slash-
segmented keys join everything before the last /; dot-segmented keys
keep the leading short-alphabetic prefix segments before the first
digit-bearing segment (model body). Required to match exactly between
current and candidate.
* _MIN_INPUT_COST_RATIO = 1.5 floor: weak suggestions (<1.5x cheaper)
are suppressed. AWS pricing-tier noise (~10%) stays out of the panel;
the 1.7x sonnet-vs-opus and 5x haiku-vs-opus suggestions still fire.
* evolve_skill + evolve_tool gate: only call the advisor when
_preflight_eval.lm_factory is None. CodexLM users see no panel.
Tests cover all three filters via fixture catalogs that mirror the real
naming patterns (bedrocktest.haiku-4-5 vs us.bedrocktest.haiku-4-5,
openroutertest/anthropic vs openroutertest/zai). Real-catalog smoke after
the fixes:
user -> suggestion
---------------------------------------------- -------------
anthropic/claude-opus-4-5 -> haiku-4-5 (5x)
anthropic/claude-opus-4-7 -> sonnet-4-6 (1.7x)
openai/gpt-5.4-mini -> gpt-5-nano (15x)
openai/gpt-4.1 -> gpt-4.1-nano (20x)
openrouter/anthropic/claude-opus-4 -> claude-haiku-4.5 (15x)
bedrock/us.anthropic.claude-sonnet-4-6 -> (none — below threshold)
bedrock/anthropic.claude-sonnet-4-6 -> (none — no cheaper)
bedrock/eu.anthropic.claude-opus-4-7 -> eu.anthropic.sonnet-4 (1.8x)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
model.default: claude-opus-4-5silently inherits Opus for the eval + judge roles. Opus runs ~100× per evolution; on Opus pricing that's $10–50 per run whenclaude-haiku-4-5(5× cheaper input, same 200k context) would suffice. Today the only fix is readingdocs/model_resolution.mdend-to-end and learning the per-role override pattern.evolution/core/cost_advisor.pyconsultslitellm.model_cost(a 2700+ entry curated catalog shipped with LiteLLM), enumerates same-provider candidates strictly cheaper for input tokens with at least the current context window, applies a major-version filter, a routing-namespace match, and a minimum-savings threshold.evolve_skillandevolve_tool, when--eval-modelis unset, the resolver returned a stock LM (no Codex factory), and an alternative exists, prints a Rich panel showing cost-per-1M for both models, the cost ratio, the context window, and a paste-ready--eval-modelflag.openrouter/openai/gpt-5) and bare (claude-opus-4-5for Anthropic-direct) catalog key shapes. Returns None for off-catalog providers and graceful skips for models without parseable pricing.--no-cost-suggestflag suppresses the panel for users who don't want the noise.Filters applied to candidate suggestions
litellm_providerclaude-opus-4-5shouldn't suggestclaude-3-haiku_namespace()lm_factory is NoneWhat the advisor does, by current model (real LiteLLM catalog)
claude-opus-4-5(200k)claude-haiku-4-5(5× cheaper, 200k)claude-opus-4-6/4-7(1M)claude-sonnet-4-6(1.7× cheaper, 1M)openai/gpt-5.4-mini(272k)openai/gpt-5-nano(15× cheaper, 272k)openai/gpt-4.1(1M)openai/gpt-4.1-nano(20× cheaper, 1M)openrouter/anthropic/claude-opus-4openrouter/anthropic/claude-haiku-4.5(15×)bedrock/us.anthropic.claude-sonnet-4-6bedrock/anthropic.claude-sonnet-4-6(regional)openai-codexprovider)lm_factoryset; Codex auth is a different setupOut of scope
Test plan
tests/core/test_cost_advisor.py(lookup, version parsing, namespace parsing, generation filter, namespace filter, minimum-savings threshold, panel rendering) + 6 intests/skills/test_evolve_skill_cost_suggest.py(panel firing rules including Codex skip).gpt-5.4-minivia OpenAI custom endpoint): cost-suggest panel renders correctly with--eval-model openai/gpt-5-nano(15× cheaper).--no-cost-suggestand explicit--eval-model openai/gpt-5-nanoboth correctly suppress the panel; run continues to dataset gen as expected.Spike notes
Two spikes shaped this PR:
Initial spike: validated
litellm.model_costshape and coverage. Direct Anthropic, OpenAI, OpenRouter all looked up successfully via strip-prefix fallback.End-to-end CLI smoke (during PR review): surfaced four real bugs in the v2 design —
claude-opus-4-5was suggestingclaude-3-haiku-20240307(gen-3 downgrade);bedrock/us.Xwas suggestingbedrock/X(cross-region surprise);openrouter/anthropic/Xwas suggestingopenrouter/z-ai/X(cross-vendor); and Codex users would get suggestions for OpenAI-direct models they couldn't auth to. Fixed all four in successive commits before merge.