Feature Request: Automatic fallback to a larger context window model when token limit is exceeded

### Problem Statement

When an agent's conversation, tool outputs, or retrieved context grows beyond the configured model's maximum token/context window, the SDK currently fails (or truncates) the request rather than gracefully recovering. This is a frequent pain point for long-running agents, agents that perform multi-step tool use with verbose tool outputs, agents that ingest large documents/RAG chunks, and agents running in production where input size is hard to predict in advance.

Today users have to either:
- Pre-emptively pick the largest (and most expensive) model for every request, even when it's unnecessary, or
- Implement their own custom retry/escalation logic outside the SDK, or
- Lose the conversation when a `ContextWindowExceededError` (or equivalent provider error) is raised.

There is no first-class, configurable way to say "if we hit the token limit on model A, automatically retry the request on model B which has a larger context window."

### Proposed Solution

Add a built-in fallback mechanism that allows users to configure an ordered list of models, where the SDK automatically escalates to the next model in the list when a token-limit / context-window error is encountered.

Proposed API (illustrative):

```python
from strands import Agent
from strands.models import BedrockModel, AnthropicModel

primary = BedrockModel(model_id="anthropic.claude-4-5-haiku")        # ~200k ctx, cheap
fallback = BedrockModel(model_id="anthropic.claude-sonnet-4.6")        # ~1M ctx, larger

agent = Agent(
    model=primary,
    fallback_models=[fallback],            # ordered list, tried in sequence
    fallback_on=["context_window_exceeded", "input_too_long"],
    fallback_strategy="on_token_limit",   # or "on_any_error"
)
```

Expected behavior:
1. The SDK sends the request to the primary model.
2. If the provider returns a context-window / token-limit error (e.g. `ContextWindowExceededError`, HTTP 400 with `input too long`, Bedrock `ValidationException` for token count, etc.), the SDK transparently retries the same request against the next model in `fallback_models`.
3. Conversation state, tool definitions, system prompt, and message history are preserved across the fallback so the user/agent loop is uninterrupted.
4. The fallback is logged and surfaced via a callback/event (e.g. `on_model_fallback`) so users can observe and meter costs.
5. If all fallback models also fail, the original exception is re-raised.

Additional configuration knobs to consider:
- `max_fallback_attempts` (default: len(fallback_models))
- Per-model overrides for `temperature`, `max_tokens`, etc.
- A pre-flight token estimator that proactively picks the smallest model whose context window fits the estimated input (optional optimization).
- Compatibility with streaming responses (fallback should occur before any tokens have been streamed to the caller).

### Use Case

1. **Long-running conversational agents** — A support agent accumulates dozens of turns plus tool outputs. After a few hours it exceeds the 200k context of the cheaper model. Today the agent crashes; with this feature it would seamlessly continue on a 1M-context model.

2. **RAG / document analysis** — An agent retrieves a variable number of chunks. Most queries fit in a small model, but occasionally a query pulls in a very large document. Users want to default to the cheap model and only pay for the large model when truly needed.

3. **Multi-step tool use with verbose outputs** — Tools like web scrapers, code interpreters, or DB queries can return large payloads. A run that normally fits sometimes blows past the limit; the agent should not die mid-task.

4. **Cost optimization** — Teams want to use the cheapest viable model per request rather than always provisioning for worst-case context size. Automatic escalation gives them "pay only when needed" behavior.

5. **Provider/region resilience** — The same mechanism could be reused to fall back across providers (e.g. Bedrock → Anthropic API) when a model isn't available in a given region, though the primary motivation here is the token-limit case.

### Alternatives Solutions

- **Sliding-window / message truncation**: The SDK could (or already does) drop oldest messages when the context fills up. This is lossy and hurts agent quality, especially for agents that depend on earlier turns or tool results. It's a useful complement but not a replacement for fallback.
- **Summarization / conversation compaction**: Compressing history into a summary before retrying. Higher quality than truncation but adds latency, cost, and a non-trivial failure mode of its own. Could be offered alongside fallback (e.g. `fallback_strategy=["summarize", "escalate_model"]`).
- **Always use the largest model**: Simple, but wastes money and latency on the 95% of requests that fit comfortably in a smaller model.
- **User-implemented try/except wrapper**: Works today, but every user re-implements the same logic. Making it first-class in the SDK avoids duplication and ensures correct handling of streaming, tool-use loops, and conversation state.

### Additional Context

**Open questions / design considerations:**
- How should fallback interact with structured output / tool-use loops mid-iteration? Ideally the fallback applies to the failing model call only, and the agent loop continues unchanged.
- Should fallback be configurable globally on the `Agent`, per-call, or both?
- Naming of the error types — the SDK should normalize provider-specific token-limit errors into a single `ContextWindowExceededError` (if not already) so users can target it cleanly in `fallback_on`.
- Telemetry: emit a metric/event on every fallback so users can monitor how often it triggers and tune their primary model choice.
- Documentation: a short cookbook entry showing a Haiku → Sonnet (or Sonnet → 1M-context Sonnet) escalation would make this very approachable.

**Related references:**
- Bedrock `ValidationException` for input token count
- Anthropic API `invalid_request_error` with `input is too long`
- OpenAI `context_length_exceeded`

Happy to contribute a PR if the maintainers are open to this direction — would appreciate guidance on the preferred API shape before implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Automatic fallback to a larger context window model when token limit is exceeded #2256

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Automatic fallback to a larger context window model when token limit is exceeded #2256

Description

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions