feat: cheap-model cost advisor for the eval role by jramos · Pull Request #59 · jramos/agent-self-evolution

jramos · 2026-05-15T04:00:12Z

Summary

A Hermes user with model.default: claude-opus-4-5 silently inherits Opus for the eval + judge roles. Opus runs ~100× per evolution; on Opus pricing that's $10–50 per run when claude-haiku-4-5 (5× cheaper input, same 200k context) would suffice. Today the only fix is reading docs/model_resolution.md end-to-end and learning the per-role override pattern.
New evolution/core/cost_advisor.py consults litellm.model_cost (a 2700+ entry curated catalog shipped with LiteLLM), enumerates same-provider candidates strictly cheaper for input tokens with at least the current context window, applies a major-version filter, a routing-namespace match, and a minimum-savings threshold.
After preflight in both evolve_skill and evolve_tool, when --eval-model is unset, the resolver returned a stock LM (no Codex factory), and an alternative exists, prints a Rich panel showing cost-per-1M for both models, the cost ratio, the context window, and a paste-ready --eval-model flag.
Lookup tolerates both prefixed (openrouter/openai/gpt-5) and bare (claude-opus-4-5 for Anthropic-direct) catalog key shapes. Returns None for off-catalog providers and graceful skips for models without parseable pricing.
--no-cost-suggest flag suppresses the panel for users who don't want the noise.

Filters applied to candidate suggestions

Filter	What it blocks	Why
Same `litellm_provider`	Cross-provider swaps (Anthropic-direct → OpenAI)	Different credentials, different setup
Strictly cheaper input cost	Equal-or-more expensive	No win to surface
≥ current context window	Smaller-context models	Could silently truncate inputs
Same major version	Gen-3 downgrades for gen-4 users	`claude-opus-4-5` shouldn't suggest `claude-3-haiku`
Same `_namespace()`	Bedrock cross-region/regional swaps; OpenRouter cross-vendor (anthropic→z-ai)	User picked the routing profile deliberately
Cost ratio ≥ 1.5×	Marginal AWS pricing-tier savings (~10%)	Below threshold the panel becomes noise
`lm_factory is None`	Codex (CodexLM users)	Codex auth ≠ OPENAI_API_KEY auth

What the advisor does, by current model (real LiteLLM catalog)

User on	Suggests	Why
`claude-opus-4-5` (200k)	`claude-haiku-4-5` (5× cheaper, 200k)	Same major (4); newer-minor tiebreaker
`claude-opus-4-6` / `4-7` (1M)	`claude-sonnet-4-6` (1.7× cheaper, 1M)	Sonnet is the only same-major model with 1M context
`openai/gpt-5.4-mini` (272k)	`openai/gpt-5-nano` (15× cheaper, 272k)	Same OpenAI gen-5 family
`openai/gpt-4.1` (1M)	`openai/gpt-4.1-nano` (20× cheaper, 1M)	Same gpt-4.1 family
`openrouter/anthropic/claude-opus-4`	`openrouter/anthropic/claude-haiku-4.5` (15×)	Same OpenRouter sub-vendor (anthropic)
`bedrock/us.anthropic.claude-sonnet-4-6`	(no panel)	Only marginal 1.1× savings available; below threshold
`bedrock/anthropic.claude-sonnet-4-6` (regional)	(no panel)	No same-namespace cheaper candidate
Codex (`openai-codex` provider)	(no panel)	`lm_factory` set; Codex auth is a different setup
Bedrock model not in catalog	(no panel)	Graceful skip

Out of scope

Optimizer/reflection roles: reasoning-quality-sensitive; cheaper-model swaps risk silently degrading evolution outcomes.
Cross-provider, cross-generation, cross-region, cross-vendor suggestions: deliberately excluded — see filter table above.
Auto-applying the suggestion: only surface the panel; never silently change models.
Per-call dollar projections: call counts depend on iterations × valset × judge invocations, none stable at preflight time.
Smaller-context-window suggestions (e.g. haiku-4-5 for opus-4-7 users): would need a separate "minimum context for typical eval workloads" knob.

Test plan

50 new tests total: 45 in tests/core/test_cost_advisor.py (lookup, version parsing, namespace parsing, generation filter, namespace filter, minimum-savings threshold, panel rendering) + 6 in tests/skills/test_evolve_skill_cost_suggest.py (panel firing rules including Codex skip).
Full suite: 897 tests pass (847 baseline + 50 new), no regressions.
End-to-end real CLI smoke against the user's actual Hermes config (gpt-5.4-mini via OpenAI custom endpoint): cost-suggest panel renders correctly with --eval-model openai/gpt-5-nano (15× cheaper).
Suppression smokes: --no-cost-suggest and explicit --eval-model openai/gpt-5-nano both correctly suppress the panel; run continues to dataset gen as expected.
Real-catalog audit across 9 representative model strings (Anthropic direct, OpenAI direct, OpenRouter, Bedrock cross-region, Bedrock regional, off-catalog) confirms each filter behaves correctly.
CI green on Python 3.10/3.11/3.12/3.13.

Spike notes

Two spikes shaped this PR:

Initial spike: validated litellm.model_cost shape and coverage. Direct Anthropic, OpenAI, OpenRouter all looked up successfully via strip-prefix fallback.
End-to-end CLI smoke (during PR review): surfaced four real bugs in the v2 design — claude-opus-4-5 was suggesting claude-3-haiku-20240307 (gen-3 downgrade); bedrock/us.X was suggesting bedrock/X (cross-region surprise); openrouter/anthropic/X was suggesting openrouter/z-ai/X (cross-vendor); and Codex users would get suggestions for OpenAI-direct models they couldn't auth to. Fixed all four in successive commits before merge.

A user with Hermes set to claude-opus-4-5 silently inherits Opus for the eval + judge roles, where ~100 calls per evolution makes a 5x-cheaper sibling like claude-haiku-4-5 the obvious choice. Today they have to read docs/model_resolution.md end-to-end to discover the per-role override pattern. Adds a small advisor module that consults litellm.model_cost (a 2700+ entry curated catalog shipped with LiteLLM), enumerates same-provider candidates that are strictly cheaper for input tokens with at least the current context window, and returns the cheapest qualifying option. Lookup tolerates both prefixed (openrouter/openai/gpt-5) and bare (claude-opus-4-5 for Anthropic-direct) catalog key shapes. Returns None for unknown models — Bedrock, Codex, and local-server endpoints aren't in the catalog and skip gracefully rather than guess. Cross-provider swaps are intentionally excluded; staying within the user's already-configured provider keeps the suggestion paste-ready for --eval-model without a second credential setup. Renders a Rich panel mirroring the existing operator-facing panel style, surfacing the absolute cost-per-1M for both models, the cost ratio, the context window, and a paste-ready CLI flag — plus a one- line caveat about why the optimizer/reflection roles are intentionally left out.

…suggest Surfaces a Rich panel after the preflight probe in both evolve_skill and evolve_tool when the user inherited the eval model from Hermes (i.e., --eval-model is unset) and a strictly cheaper same-provider alternative exists in litellm.model_cost. The panel shows absolute cost-per-1M for both models, the cost ratio, the context window, and a paste-ready --eval-model flag — plus a one-line caveat that the optimizer/reflection roles are intentionally untouched (reasoning quality matters there). Skip rules: * --eval-model is explicit -> user already chose; respect it * --no-cost-suggest -> explicit operator opt-out * --dry-run -> matches existing preflight skip-on-dry-run * Resolved model not in the LiteLLM catalog (Bedrock, Codex, local) -> advisor returns None and no panel renders The eval-LM resolve is hoisted out of the preflight conditional so the advisor and the preflight share one ResolvedLM rather than re-walking config.yaml + auth.json. Adds ~5ms of file I/O on the --no-preflight path that previously deferred resolution to GEPA setup; trade is worth it for cleaner control flow. docs/model_resolution.md "Cost considerations" section gets a third bullet pointing at the advisor + the --no-cost-suggest escape hatch.

Spike-revealed regression: a user on claude-opus-4-5 was being suggested claude-3-haiku-20240307 (March 2024, gen 3) at 20× cheaper, because pure input-cost-sort wins over the current-gen claude-haiku-4-5 at 5× cheaper. A user pasting that flag would silently downgrade to a 20-month-old gen-3 model with noticeably weaker evaluations — the opposite of the helpful nudge the advisor is supposed to be. Adds a (major, minor) version parser tolerant of real catalog naming: claude-opus-4-5 -> (4, 5) claude-opus-4-5-20251101 -> (4, 5) date suffix ignored claude-3-opus-20240229 -> (3, 0) claude-4-sonnet-20250514 -> (4, 0) digit not followed by -minor claude-sonnet-4-6 -> (4, 6) custom-local-model -> (None, 0) unparseable; degrades open Filter rule: same major version as the current model. Unparseable major treated as "match all" so custom/local-server names don't silently lose all suggestions. Sort key adds minor desc as a tiebreaker so when claude-sonnet-4-6 and claude-4-sonnet-20250514 both score $3/M with 1M context, the canonical newer-named model wins. Effective change vs. real catalog: opus-4-5 -> haiku-4-5 (was: 3-haiku-20240307) fix opus-4-7 -> sonnet-4-6 (was: 4-sonnet-20250514) cleaner opus-4-1 -> haiku-4-5 (was: 3-haiku-20240307) fix 3-opus-* -> 3-haiku-* (unchanged — gen-3 user gets gen-3) The strict context-window filter is unchanged. opus-4-7 (1M context) users still see only sonnet-4-6 (1M, 1.7× cheaper), not haiku-4-5 (200k, 5× cheaper). Relaxing context would need a "minimum eval-workload context" knob; out of scope here.

…x skip End-to-end CLI smoke against the real LiteLLM catalog surfaced three edge cases the v2 cost advisor handled wrong: 1. Bedrock cross-region: a user on bedrock/us.anthropic.claude-sonnet-4-6 was being suggested bedrock/anthropic.claude-sonnet-4-6 (regional-only). The user picked the us.* cross-region inference profile deliberately for failover/throughput; suggesting the regional version silently reverts that choice. 2. OpenRouter cross-vendor: a user on openrouter/anthropic/claude-opus-4 was being suggested openrouter/z-ai/glm-4.7-flash — same litellm provider (openrouter) but completely different upstream model vendor. 3. Marginal Bedrock pricing: us.anthropic.claude-sonnet-4-6 ($3.30/M) was being suggested over us.anthropic.claude-sonnet-4-20250514-v1:0 ($3.00/M), a date-suffixed older snapshot at 1.1× savings — too small to be worth interrupting the user for. Plus a fourth case the smoke confirmed needed handling: Codex users get a CodexLM via the resolver's lm_factory hook, but the model string is still openai/gpt-5-codex. The advisor would suggest openai/gpt-5-nano, which implies a different auth setup (OPENAI_API_KEY) the Codex user deliberately didn't opt in to. Three fixes: * _namespace(catalog_key) extracts a routing-namespace path. Slash- segmented keys join everything before the last /; dot-segmented keys keep the leading short-alphabetic prefix segments before the first digit-bearing segment (model body). Required to match exactly between current and candidate. * _MIN_INPUT_COST_RATIO = 1.5 floor: weak suggestions (<1.5x cheaper) are suppressed. AWS pricing-tier noise (~10%) stays out of the panel; the 1.7x sonnet-vs-opus and 5x haiku-vs-opus suggestions still fire. * evolve_skill + evolve_tool gate: only call the advisor when _preflight_eval.lm_factory is None. CodexLM users see no panel. Tests cover all three filters via fixture catalogs that mirror the real naming patterns (bedrocktest.haiku-4-5 vs us.bedrocktest.haiku-4-5, openroutertest/anthropic vs openroutertest/zai). Real-catalog smoke after the fixes: user -> suggestion ---------------------------------------------- ------------- anthropic/claude-opus-4-5 -> haiku-4-5 (5x) anthropic/claude-opus-4-7 -> sonnet-4-6 (1.7x) openai/gpt-5.4-mini -> gpt-5-nano (15x) openai/gpt-4.1 -> gpt-4.1-nano (20x) openrouter/anthropic/claude-opus-4 -> claude-haiku-4.5 (15x) bedrock/us.anthropic.claude-sonnet-4-6 -> (none — below threshold) bedrock/anthropic.claude-sonnet-4-6 -> (none — no cheaper) bedrock/eu.anthropic.claude-opus-4-7 -> eu.anthropic.sonnet-4 (1.8x)

jramos added 4 commits May 14, 2026 21:49

jramos merged commit 093e566 into main May 15, 2026
4 checks passed

jramos deleted the feat/cheap-model-cost-advisor branch May 15, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cheap-model cost advisor for the eval role#59

feat: cheap-model cost advisor for the eval role#59
jramos merged 4 commits into
mainfrom
feat/cheap-model-cost-advisor

jramos commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Filters applied to candidate suggestions

What the advisor does, by current model (real LiteLLM catalog)

Out of scope

Test plan

Spike notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 15, 2026 •

edited

Loading