Skip to content

Feature Request: Automatic fallback to a larger context window model when token limit is exceeded #2256

@schakraborty-staclline

Description

@schakraborty-staclline

Problem Statement

When an agent's conversation, tool outputs, or retrieved context grows beyond the configured model's maximum token/context window, the SDK currently fails (or truncates) the request rather than gracefully recovering. This is a frequent pain point for long-running agents, agents that perform multi-step tool use with verbose tool outputs, agents that ingest large documents/RAG chunks, and agents running in production where input size is hard to predict in advance.

Today users have to either:

  • Pre-emptively pick the largest (and most expensive) model for every request, even when it's unnecessary, or
  • Implement their own custom retry/escalation logic outside the SDK, or
  • Lose the conversation when a ContextWindowExceededError (or equivalent provider error) is raised.

There is no first-class, configurable way to say "if we hit the token limit on model A, automatically retry the request on model B which has a larger context window."

Proposed Solution

Add a built-in fallback mechanism that allows users to configure an ordered list of models, where the SDK automatically escalates to the next model in the list when a token-limit / context-window error is encountered.

Proposed API (illustrative):

from strands import Agent
from strands.models import BedrockModel, AnthropicModel

primary = BedrockModel(model_id="anthropic.claude-4-5-haiku")        # ~200k ctx, cheap
fallback = BedrockModel(model_id="anthropic.claude-sonnet-4.6")        # ~1M ctx, larger

agent = Agent(
    model=primary,
    fallback_models=[fallback],            # ordered list, tried in sequence
    fallback_on=["context_window_exceeded", "input_too_long"],
    fallback_strategy="on_token_limit",   # or "on_any_error"
)

Expected behavior:

  1. The SDK sends the request to the primary model.
  2. If the provider returns a context-window / token-limit error (e.g. ContextWindowExceededError, HTTP 400 with input too long, Bedrock ValidationException for token count, etc.), the SDK transparently retries the same request against the next model in fallback_models.
  3. Conversation state, tool definitions, system prompt, and message history are preserved across the fallback so the user/agent loop is uninterrupted.
  4. The fallback is logged and surfaced via a callback/event (e.g. on_model_fallback) so users can observe and meter costs.
  5. If all fallback models also fail, the original exception is re-raised.

Additional configuration knobs to consider:

  • max_fallback_attempts (default: len(fallback_models))
  • Per-model overrides for temperature, max_tokens, etc.
  • A pre-flight token estimator that proactively picks the smallest model whose context window fits the estimated input (optional optimization).
  • Compatibility with streaming responses (fallback should occur before any tokens have been streamed to the caller).

Use Case

  1. Long-running conversational agents — A support agent accumulates dozens of turns plus tool outputs. After a few hours it exceeds the 200k context of the cheaper model. Today the agent crashes; with this feature it would seamlessly continue on a 1M-context model.

  2. RAG / document analysis — An agent retrieves a variable number of chunks. Most queries fit in a small model, but occasionally a query pulls in a very large document. Users want to default to the cheap model and only pay for the large model when truly needed.

  3. Multi-step tool use with verbose outputs — Tools like web scrapers, code interpreters, or DB queries can return large payloads. A run that normally fits sometimes blows past the limit; the agent should not die mid-task.

  4. Cost optimization — Teams want to use the cheapest viable model per request rather than always provisioning for worst-case context size. Automatic escalation gives them "pay only when needed" behavior.

  5. Provider/region resilience — The same mechanism could be reused to fall back across providers (e.g. Bedrock → Anthropic API) when a model isn't available in a given region, though the primary motivation here is the token-limit case.

Alternatives Solutions

  • Sliding-window / message truncation: The SDK could (or already does) drop oldest messages when the context fills up. This is lossy and hurts agent quality, especially for agents that depend on earlier turns or tool results. It's a useful complement but not a replacement for fallback.
  • Summarization / conversation compaction: Compressing history into a summary before retrying. Higher quality than truncation but adds latency, cost, and a non-trivial failure mode of its own. Could be offered alongside fallback (e.g. fallback_strategy=["summarize", "escalate_model"]).
  • Always use the largest model: Simple, but wastes money and latency on the 95% of requests that fit comfortably in a smaller model.
  • User-implemented try/except wrapper: Works today, but every user re-implements the same logic. Making it first-class in the SDK avoids duplication and ensures correct handling of streaming, tool-use loops, and conversation state.

Additional Context

Open questions / design considerations:

  • How should fallback interact with structured output / tool-use loops mid-iteration? Ideally the fallback applies to the failing model call only, and the agent loop continues unchanged.
  • Should fallback be configurable globally on the Agent, per-call, or both?
  • Naming of the error types — the SDK should normalize provider-specific token-limit errors into a single ContextWindowExceededError (if not already) so users can target it cleanly in fallback_on.
  • Telemetry: emit a metric/event on every fallback so users can monitor how often it triggers and tune their primary model choice.
  • Documentation: a short cookbook entry showing a Haiku → Sonnet (or Sonnet → 1M-context Sonnet) escalation would make this very approachable.

Related references:

  • Bedrock ValidationException for input token count
  • Anthropic API invalid_request_error with input is too long
  • OpenAI context_length_exceeded

Happy to contribute a PR if the maintainers are open to this direction — would appreciate guidance on the preferred API shape before implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions