Benchmark context management features and calibrate preset defaults

**Description:**

The context management presets ship with opinionated defaults — pinning thresholds, compression triggers, externalization token limits, etc. These defaults need to be validated against benchmarks before shipping, not guessed.

## Motivation

Initial research suggests position-based context management strategies are generally inferior to importance-based approaches. Before we commit to specific defaults (e.g., `protected_messages=1`, compression at 80% capacity, externalization at 2500 tokens), we need data showing they actually improve task completion and/or reduce cost without degrading accuracy.

## What to benchmark

Each context management feature that ships with a configurable default should be benchmarked to determine optimal values. This includes externalization thresholds and preview sizes, pinning defaults, compression trigger points, and the impact of agentic tools on task completion vs. additional cost.

For each feature, measure with and without, and sweep across reasonable default values to find the best tradeoff.

## Metrics

- Task completion rate
- Accuracy / output quality
- Total tokens consumed and cost
- Context overflow failures
- Additional tool-use turns (agentic vs. auto)

## Baselines

- No context management (raw overflow behavior)
- Sliding window only
- Manual plugin wiring (current state)

## Deliverables

1. Reproducible benchmark suite checked into the repo
2. Results summary with recommended defaults for each configurable value
3. Update `_PRESETS` definitions and `ContextManagementConfig` defaults to match findings
4. Document any surprising results — if a feature doesn't help, remove it from the preset

---

Want any changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark context management features and calibrate preset defaults #2225

Motivation

What to benchmark

Metrics

Baselines

Deliverables

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark context management features and calibrate preset defaults #2225

Description

Motivation

What to benchmark

Metrics

Baselines

Deliverables

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions