Skip to content

Benchmark context management features and calibrate preset defaults #2225

@lizradway

Description

@lizradway

Description:

The context management presets ship with opinionated defaults — pinning thresholds, compression triggers, externalization token limits, etc. These defaults need to be validated against benchmarks before shipping, not guessed.

Motivation

Initial research suggests position-based context management strategies are generally inferior to importance-based approaches. Before we commit to specific defaults (e.g., protected_messages=1, compression at 80% capacity, externalization at 2500 tokens), we need data showing they actually improve task completion and/or reduce cost without degrading accuracy.

What to benchmark

Each context management feature that ships with a configurable default should be benchmarked to determine optimal values. This includes externalization thresholds and preview sizes, pinning defaults, compression trigger points, and the impact of agentic tools on task completion vs. additional cost.

For each feature, measure with and without, and sweep across reasonable default values to find the best tradeoff.

Metrics

  • Task completion rate
  • Accuracy / output quality
  • Total tokens consumed and cost
  • Context overflow failures
  • Additional tool-use turns (agentic vs. auto)

Baselines

  • No context management (raw overflow behavior)
  • Sliding window only
  • Manual plugin wiring (current state)

Deliverables

  1. Reproducible benchmark suite checked into the repo
  2. Results summary with recommended defaults for each configurable value
  3. Update _PRESETS definitions and ContextManagementConfig defaults to match findings
  4. Document any surprising results — if a feature doesn't help, remove it from the preset

Want any changes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions