Problem Statement
When an agent's conversation, tool outputs, or retrieved context grows beyond the configured model's maximum token/context window, the SDK currently fails (or truncates) the request rather than gracefully recovering. This is a frequent pain point for long-running agents, agents that perform multi-step tool use with verbose tool outputs, agents that ingest large documents/RAG chunks, and agents running in production where input size is hard to predict in advance.
Today users have to either:
- Pre-emptively pick the largest (and most expensive) model for every request, even when it's unnecessary, or
- Implement their own custom retry/escalation logic outside the SDK, or
- Lose the conversation when a
ContextWindowExceededError (or equivalent provider error) is raised.
There is no first-class, configurable way to say "if we hit the token limit on model A, automatically retry the request on model B which has a larger context window."
Proposed Solution
Add a built-in fallback mechanism that allows users to configure an ordered list of models, where the SDK automatically escalates to the next model in the list when a token-limit / context-window error is encountered.
Proposed API (illustrative):
from strands import Agent
from strands.models import BedrockModel, AnthropicModel
primary = BedrockModel(model_id="anthropic.claude-4-5-haiku") # ~200k ctx, cheap
fallback = BedrockModel(model_id="anthropic.claude-sonnet-4.6") # ~1M ctx, larger
agent = Agent(
model=primary,
fallback_models=[fallback], # ordered list, tried in sequence
fallback_on=["context_window_exceeded", "input_too_long"],
fallback_strategy="on_token_limit", # or "on_any_error"
)
Expected behavior:
- The SDK sends the request to the primary model.
- If the provider returns a context-window / token-limit error (e.g.
ContextWindowExceededError, HTTP 400 with input too long, Bedrock ValidationException for token count, etc.), the SDK transparently retries the same request against the next model in fallback_models.
- Conversation state, tool definitions, system prompt, and message history are preserved across the fallback so the user/agent loop is uninterrupted.
- The fallback is logged and surfaced via a callback/event (e.g.
on_model_fallback) so users can observe and meter costs.
- If all fallback models also fail, the original exception is re-raised.
Additional configuration knobs to consider:
max_fallback_attempts (default: len(fallback_models))
- Per-model overrides for
temperature, max_tokens, etc.
- A pre-flight token estimator that proactively picks the smallest model whose context window fits the estimated input (optional optimization).
- Compatibility with streaming responses (fallback should occur before any tokens have been streamed to the caller).
Use Case
-
Long-running conversational agents — A support agent accumulates dozens of turns plus tool outputs. After a few hours it exceeds the 200k context of the cheaper model. Today the agent crashes; with this feature it would seamlessly continue on a 1M-context model.
-
RAG / document analysis — An agent retrieves a variable number of chunks. Most queries fit in a small model, but occasionally a query pulls in a very large document. Users want to default to the cheap model and only pay for the large model when truly needed.
-
Multi-step tool use with verbose outputs — Tools like web scrapers, code interpreters, or DB queries can return large payloads. A run that normally fits sometimes blows past the limit; the agent should not die mid-task.
-
Cost optimization — Teams want to use the cheapest viable model per request rather than always provisioning for worst-case context size. Automatic escalation gives them "pay only when needed" behavior.
-
Provider/region resilience — The same mechanism could be reused to fall back across providers (e.g. Bedrock → Anthropic API) when a model isn't available in a given region, though the primary motivation here is the token-limit case.
Alternatives Solutions
- Sliding-window / message truncation: The SDK could (or already does) drop oldest messages when the context fills up. This is lossy and hurts agent quality, especially for agents that depend on earlier turns or tool results. It's a useful complement but not a replacement for fallback.
- Summarization / conversation compaction: Compressing history into a summary before retrying. Higher quality than truncation but adds latency, cost, and a non-trivial failure mode of its own. Could be offered alongside fallback (e.g.
fallback_strategy=["summarize", "escalate_model"]).
- Always use the largest model: Simple, but wastes money and latency on the 95% of requests that fit comfortably in a smaller model.
- User-implemented try/except wrapper: Works today, but every user re-implements the same logic. Making it first-class in the SDK avoids duplication and ensures correct handling of streaming, tool-use loops, and conversation state.
Additional Context
Open questions / design considerations:
- How should fallback interact with structured output / tool-use loops mid-iteration? Ideally the fallback applies to the failing model call only, and the agent loop continues unchanged.
- Should fallback be configurable globally on the
Agent, per-call, or both?
- Naming of the error types — the SDK should normalize provider-specific token-limit errors into a single
ContextWindowExceededError (if not already) so users can target it cleanly in fallback_on.
- Telemetry: emit a metric/event on every fallback so users can monitor how often it triggers and tune their primary model choice.
- Documentation: a short cookbook entry showing a Haiku → Sonnet (or Sonnet → 1M-context Sonnet) escalation would make this very approachable.
Related references:
- Bedrock
ValidationException for input token count
- Anthropic API
invalid_request_error with input is too long
- OpenAI
context_length_exceeded
Happy to contribute a PR if the maintainers are open to this direction — would appreciate guidance on the preferred API shape before implementation.
Problem Statement
When an agent's conversation, tool outputs, or retrieved context grows beyond the configured model's maximum token/context window, the SDK currently fails (or truncates) the request rather than gracefully recovering. This is a frequent pain point for long-running agents, agents that perform multi-step tool use with verbose tool outputs, agents that ingest large documents/RAG chunks, and agents running in production where input size is hard to predict in advance.
Today users have to either:
ContextWindowExceededError(or equivalent provider error) is raised.There is no first-class, configurable way to say "if we hit the token limit on model A, automatically retry the request on model B which has a larger context window."
Proposed Solution
Add a built-in fallback mechanism that allows users to configure an ordered list of models, where the SDK automatically escalates to the next model in the list when a token-limit / context-window error is encountered.
Proposed API (illustrative):
Expected behavior:
ContextWindowExceededError, HTTP 400 withinput too long, BedrockValidationExceptionfor token count, etc.), the SDK transparently retries the same request against the next model infallback_models.on_model_fallback) so users can observe and meter costs.Additional configuration knobs to consider:
max_fallback_attempts(default: len(fallback_models))temperature,max_tokens, etc.Use Case
Long-running conversational agents — A support agent accumulates dozens of turns plus tool outputs. After a few hours it exceeds the 200k context of the cheaper model. Today the agent crashes; with this feature it would seamlessly continue on a 1M-context model.
RAG / document analysis — An agent retrieves a variable number of chunks. Most queries fit in a small model, but occasionally a query pulls in a very large document. Users want to default to the cheap model and only pay for the large model when truly needed.
Multi-step tool use with verbose outputs — Tools like web scrapers, code interpreters, or DB queries can return large payloads. A run that normally fits sometimes blows past the limit; the agent should not die mid-task.
Cost optimization — Teams want to use the cheapest viable model per request rather than always provisioning for worst-case context size. Automatic escalation gives them "pay only when needed" behavior.
Provider/region resilience — The same mechanism could be reused to fall back across providers (e.g. Bedrock → Anthropic API) when a model isn't available in a given region, though the primary motivation here is the token-limit case.
Alternatives Solutions
fallback_strategy=["summarize", "escalate_model"]).Additional Context
Open questions / design considerations:
Agent, per-call, or both?ContextWindowExceededError(if not already) so users can target it cleanly infallback_on.Related references:
ValidationExceptionfor input token countinvalid_request_errorwithinput is too longcontext_length_exceededHappy to contribute a PR if the maintainers are open to this direction — would appreciate guidance on the preferred API shape before implementation.