Skip to content

feat: cheap-model cost advisor for the eval role#59

Merged
jramos merged 4 commits into
mainfrom
feat/cheap-model-cost-advisor
May 15, 2026
Merged

feat: cheap-model cost advisor for the eval role#59
jramos merged 4 commits into
mainfrom
feat/cheap-model-cost-advisor

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 15, 2026

Summary

  • A Hermes user with model.default: claude-opus-4-5 silently inherits Opus for the eval + judge roles. Opus runs ~100× per evolution; on Opus pricing that's $10–50 per run when claude-haiku-4-5 (5× cheaper input, same 200k context) would suffice. Today the only fix is reading docs/model_resolution.md end-to-end and learning the per-role override pattern.
  • New evolution/core/cost_advisor.py consults litellm.model_cost (a 2700+ entry curated catalog shipped with LiteLLM), enumerates same-provider candidates strictly cheaper for input tokens with at least the current context window, applies a major-version filter, a routing-namespace match, and a minimum-savings threshold.
  • After preflight in both evolve_skill and evolve_tool, when --eval-model is unset, the resolver returned a stock LM (no Codex factory), and an alternative exists, prints a Rich panel showing cost-per-1M for both models, the cost ratio, the context window, and a paste-ready --eval-model flag.
  • Lookup tolerates both prefixed (openrouter/openai/gpt-5) and bare (claude-opus-4-5 for Anthropic-direct) catalog key shapes. Returns None for off-catalog providers and graceful skips for models without parseable pricing.
  • --no-cost-suggest flag suppresses the panel for users who don't want the noise.

Filters applied to candidate suggestions

Filter What it blocks Why
Same litellm_provider Cross-provider swaps (Anthropic-direct → OpenAI) Different credentials, different setup
Strictly cheaper input cost Equal-or-more expensive No win to surface
≥ current context window Smaller-context models Could silently truncate inputs
Same major version Gen-3 downgrades for gen-4 users claude-opus-4-5 shouldn't suggest claude-3-haiku
Same _namespace() Bedrock cross-region/regional swaps; OpenRouter cross-vendor (anthropic→z-ai) User picked the routing profile deliberately
Cost ratio ≥ 1.5× Marginal AWS pricing-tier savings (~10%) Below threshold the panel becomes noise
lm_factory is None Codex (CodexLM users) Codex auth ≠ OPENAI_API_KEY auth

What the advisor does, by current model (real LiteLLM catalog)

User on Suggests Why
claude-opus-4-5 (200k) claude-haiku-4-5 (5× cheaper, 200k) Same major (4); newer-minor tiebreaker
claude-opus-4-6 / 4-7 (1M) claude-sonnet-4-6 (1.7× cheaper, 1M) Sonnet is the only same-major model with 1M context
openai/gpt-5.4-mini (272k) openai/gpt-5-nano (15× cheaper, 272k) Same OpenAI gen-5 family
openai/gpt-4.1 (1M) openai/gpt-4.1-nano (20× cheaper, 1M) Same gpt-4.1 family
openrouter/anthropic/claude-opus-4 openrouter/anthropic/claude-haiku-4.5 (15×) Same OpenRouter sub-vendor (anthropic)
bedrock/us.anthropic.claude-sonnet-4-6 (no panel) Only marginal 1.1× savings available; below threshold
bedrock/anthropic.claude-sonnet-4-6 (regional) (no panel) No same-namespace cheaper candidate
Codex (openai-codex provider) (no panel) lm_factory set; Codex auth is a different setup
Bedrock model not in catalog (no panel) Graceful skip

Out of scope

  • Optimizer/reflection roles: reasoning-quality-sensitive; cheaper-model swaps risk silently degrading evolution outcomes.
  • Cross-provider, cross-generation, cross-region, cross-vendor suggestions: deliberately excluded — see filter table above.
  • Auto-applying the suggestion: only surface the panel; never silently change models.
  • Per-call dollar projections: call counts depend on iterations × valset × judge invocations, none stable at preflight time.
  • Smaller-context-window suggestions (e.g. haiku-4-5 for opus-4-7 users): would need a separate "minimum context for typical eval workloads" knob.

Test plan

  • 50 new tests total: 45 in tests/core/test_cost_advisor.py (lookup, version parsing, namespace parsing, generation filter, namespace filter, minimum-savings threshold, panel rendering) + 6 in tests/skills/test_evolve_skill_cost_suggest.py (panel firing rules including Codex skip).
  • Full suite: 897 tests pass (847 baseline + 50 new), no regressions.
  • End-to-end real CLI smoke against the user's actual Hermes config (gpt-5.4-mini via OpenAI custom endpoint): cost-suggest panel renders correctly with --eval-model openai/gpt-5-nano (15× cheaper).
  • Suppression smokes: --no-cost-suggest and explicit --eval-model openai/gpt-5-nano both correctly suppress the panel; run continues to dataset gen as expected.
  • Real-catalog audit across 9 representative model strings (Anthropic direct, OpenAI direct, OpenRouter, Bedrock cross-region, Bedrock regional, off-catalog) confirms each filter behaves correctly.
  • CI green on Python 3.10/3.11/3.12/3.13.

Spike notes

Two spikes shaped this PR:

  1. Initial spike: validated litellm.model_cost shape and coverage. Direct Anthropic, OpenAI, OpenRouter all looked up successfully via strip-prefix fallback.

  2. End-to-end CLI smoke (during PR review): surfaced four real bugs in the v2 design — claude-opus-4-5 was suggesting claude-3-haiku-20240307 (gen-3 downgrade); bedrock/us.X was suggesting bedrock/X (cross-region surprise); openrouter/anthropic/X was suggesting openrouter/z-ai/X (cross-vendor); and Codex users would get suggestions for OpenAI-direct models they couldn't auth to. Fixed all four in successive commits before merge.

jramos added 4 commits May 14, 2026 21:49
A user with Hermes set to claude-opus-4-5 silently inherits Opus for the
eval + judge roles, where ~100 calls per evolution makes a 5x-cheaper
sibling like claude-haiku-4-5 the obvious choice. Today they have to
read docs/model_resolution.md end-to-end to discover the per-role
override pattern.

Adds a small advisor module that consults litellm.model_cost (a 2700+
entry curated catalog shipped with LiteLLM), enumerates same-provider
candidates that are strictly cheaper for input tokens with at least the
current context window, and returns the cheapest qualifying option.
Lookup tolerates both prefixed (openrouter/openai/gpt-5) and bare
(claude-opus-4-5 for Anthropic-direct) catalog key shapes.

Returns None for unknown models — Bedrock, Codex, and local-server
endpoints aren't in the catalog and skip gracefully rather than guess.
Cross-provider swaps are intentionally excluded; staying within the
user's already-configured provider keeps the suggestion paste-ready
for --eval-model without a second credential setup.

Renders a Rich panel mirroring the existing operator-facing panel
style, surfacing the absolute cost-per-1M for both models, the cost
ratio, the context window, and a paste-ready CLI flag — plus a one-
line caveat about why the optimizer/reflection roles are intentionally
left out.
…suggest

Surfaces a Rich panel after the preflight probe in both evolve_skill and
evolve_tool when the user inherited the eval model from Hermes (i.e.,
--eval-model is unset) and a strictly cheaper same-provider alternative
exists in litellm.model_cost. The panel shows absolute cost-per-1M for
both models, the cost ratio, the context window, and a paste-ready
--eval-model flag — plus a one-line caveat that the optimizer/reflection
roles are intentionally untouched (reasoning quality matters there).

Skip rules:
  * --eval-model is explicit  -> user already chose; respect it
  * --no-cost-suggest          -> explicit operator opt-out
  * --dry-run                  -> matches existing preflight skip-on-dry-run
  * Resolved model not in the LiteLLM catalog (Bedrock, Codex, local) ->
    advisor returns None and no panel renders

The eval-LM resolve is hoisted out of the preflight conditional so the
advisor and the preflight share one ResolvedLM rather than re-walking
config.yaml + auth.json. Adds ~5ms of file I/O on the --no-preflight path
that previously deferred resolution to GEPA setup; trade is worth it for
cleaner control flow.

docs/model_resolution.md "Cost considerations" section gets a third
bullet pointing at the advisor + the --no-cost-suggest escape hatch.
Spike-revealed regression: a user on claude-opus-4-5 was being suggested
claude-3-haiku-20240307 (March 2024, gen 3) at 20× cheaper, because pure
input-cost-sort wins over the current-gen claude-haiku-4-5 at 5× cheaper.
A user pasting that flag would silently downgrade to a 20-month-old gen-3
model with noticeably weaker evaluations — the opposite of the helpful
nudge the advisor is supposed to be.

Adds a (major, minor) version parser tolerant of real catalog naming:
  claude-opus-4-5         -> (4, 5)
  claude-opus-4-5-20251101 -> (4, 5)  date suffix ignored
  claude-3-opus-20240229  -> (3, 0)
  claude-4-sonnet-20250514 -> (4, 0)  digit not followed by -minor
  claude-sonnet-4-6       -> (4, 6)
  custom-local-model      -> (None, 0)  unparseable; degrades open

Filter rule: same major version as the current model. Unparseable major
treated as "match all" so custom/local-server names don't silently lose
all suggestions. Sort key adds minor desc as a tiebreaker so when
claude-sonnet-4-6 and claude-4-sonnet-20250514 both score $3/M with 1M
context, the canonical newer-named model wins.

Effective change vs. real catalog:
  opus-4-5 -> haiku-4-5      (was: 3-haiku-20240307)  fix
  opus-4-7 -> sonnet-4-6     (was: 4-sonnet-20250514) cleaner
  opus-4-1 -> haiku-4-5      (was: 3-haiku-20240307)  fix
  3-opus-* -> 3-haiku-*      (unchanged — gen-3 user gets gen-3)

The strict context-window filter is unchanged. opus-4-7 (1M context)
users still see only sonnet-4-6 (1M, 1.7× cheaper), not haiku-4-5
(200k, 5× cheaper). Relaxing context would need a "minimum eval-workload
context" knob; out of scope here.
…x skip

End-to-end CLI smoke against the real LiteLLM catalog surfaced three
edge cases the v2 cost advisor handled wrong:

1. Bedrock cross-region: a user on bedrock/us.anthropic.claude-sonnet-4-6
   was being suggested bedrock/anthropic.claude-sonnet-4-6 (regional-only).
   The user picked the us.* cross-region inference profile deliberately
   for failover/throughput; suggesting the regional version silently
   reverts that choice.

2. OpenRouter cross-vendor: a user on openrouter/anthropic/claude-opus-4
   was being suggested openrouter/z-ai/glm-4.7-flash — same litellm
   provider (openrouter) but completely different upstream model vendor.

3. Marginal Bedrock pricing: us.anthropic.claude-sonnet-4-6 ($3.30/M)
   was being suggested over us.anthropic.claude-sonnet-4-20250514-v1:0
   ($3.00/M), a date-suffixed older snapshot at 1.1× savings — too small
   to be worth interrupting the user for.

Plus a fourth case the smoke confirmed needed handling: Codex users get
a CodexLM via the resolver's lm_factory hook, but the model string is
still openai/gpt-5-codex. The advisor would suggest openai/gpt-5-nano,
which implies a different auth setup (OPENAI_API_KEY) the Codex user
deliberately didn't opt in to.

Three fixes:

  * _namespace(catalog_key) extracts a routing-namespace path. Slash-
    segmented keys join everything before the last /; dot-segmented keys
    keep the leading short-alphabetic prefix segments before the first
    digit-bearing segment (model body). Required to match exactly between
    current and candidate.

  * _MIN_INPUT_COST_RATIO = 1.5 floor: weak suggestions (<1.5x cheaper)
    are suppressed. AWS pricing-tier noise (~10%) stays out of the panel;
    the 1.7x sonnet-vs-opus and 5x haiku-vs-opus suggestions still fire.

  * evolve_skill + evolve_tool gate: only call the advisor when
    _preflight_eval.lm_factory is None. CodexLM users see no panel.

Tests cover all three filters via fixture catalogs that mirror the real
naming patterns (bedrocktest.haiku-4-5 vs us.bedrocktest.haiku-4-5,
openroutertest/anthropic vs openroutertest/zai). Real-catalog smoke after
the fixes:

  user                                          -> suggestion
  ---------------------------------------------- -------------
  anthropic/claude-opus-4-5                     -> haiku-4-5 (5x)
  anthropic/claude-opus-4-7                     -> sonnet-4-6 (1.7x)
  openai/gpt-5.4-mini                           -> gpt-5-nano (15x)
  openai/gpt-4.1                                -> gpt-4.1-nano (20x)
  openrouter/anthropic/claude-opus-4            -> claude-haiku-4.5 (15x)
  bedrock/us.anthropic.claude-sonnet-4-6        -> (none — below threshold)
  bedrock/anthropic.claude-sonnet-4-6           -> (none — no cheaper)
  bedrock/eu.anthropic.claude-opus-4-7          -> eu.anthropic.sonnet-4 (1.8x)
@jramos jramos merged commit 093e566 into main May 15, 2026
4 checks passed
@jramos jramos deleted the feat/cheap-model-cost-advisor branch May 15, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant