Context compression proxy for LLM agents. Reduces token usage 40-70% with zero code changes.
uv run kompact proxy --port 7878
export ANTHROPIC_BASE_URL=http://localhost:7878
# That's it. Your agent now uses fewer tokens.Evaluated on BFCL (1,431 real API schemas), the standard benchmark for tool-calling agents. End-to-end through Claude via claude-relay, scored with context-bench.
Quality impact vs no compression (positive = better, negative = worse):
| Model | Kompact | Headroom | LLMLingua-2 |
|---|---|---|---|
| Haiku | -2.6% | -3.0% | -23.4% |
| Sonnet | -3.9% | -3.5% | -20.6% |
| Opus | -0.5% | -0.5% | -27.3% |
Kompact and Headroom both preserve quality within ~3% of baseline — negligible degradation. LLMLingua-2 destroys tool schemas regardless of model (-20 to -27%). Headroom tested with their recommended SmartCrusher production defaults.
Offline compression measured on 12,795 examples across 3 datasets:
| Dataset | Examples | Kompact | Headroom | LLMLingua-2 |
|---|---|---|---|---|
| BFCL (tool schemas) | 1,431 | 55.3% | ~0% | 55.4% |
| Glaive (tool calling) | 3,959 | 56.6% | ~0% | ~50% |
| HotpotQA (prose QA) | 7,405 | 17.9% | ~0% | 49.9% |
Headroom's SmartCrusher doesn't compress JSON — it's designed for prose. LLMLingua-2 compresses aggressively but destroys information (see quality table above).
For a team running 1,000 agentic requests/day with ~10K token contexts:
| Model | Without Kompact | With Kompact | Monthly Savings |
|---|---|---|---|
| Sonnet ($3/M) | $900/mo | $405/mo | $495/mo |
| Opus ($15/M) | $4,500/mo | $2,025/mo | $2,475/mo |
| GPT-4o ($2.50/M) | $750/mo | $338/mo | $412/mo |
Savings scale linearly. 10K requests/day = 10x the numbers above.
Kompact is a transparent HTTP proxy. It intercepts LLM API requests, compresses the context, then forwards to the provider. No agent code changes needed.
┌──────────────────────────────────────────────┐
│ Kompact Proxy (:7878) │
│ │
Agent ─>│ 1. Schema Optimizer (TF-IDF selection) │─> LLM Provider
│ 2. Content Compressors (TOON, JSON, code) │
│ 3. Extractive Compress (TF-IDF sentences) │
│ 4. Observation Masker (history mgmt) │
│ 5. Cache Aligner (prefix caching) │
│ │
└──────────────────────────────────────────────┘
8 transforms, each targeting a different content type. The pipeline adapts automatically — short contexts get light compression, long contexts get aggressive optimization. Sub-millisecond overhead.
# Offline compression (no LLM calls, measures compression + needle preservation)
uv run python benchmarks/run_dataset_eval.py --dataset bfcl
# End-to-end quality (sends through proxy chain, measures LLM answer quality)
# Requires: claude-relay running on :8084, kompact on :7878
uv run python benchmarks/run_e2e_eval.py --dataset bfcl --model haiku --workers 20See benchmarks/README.md for full methodology.
uv sync --extra dev
uv run pytest # 48 tests
uv run ruff check src/ tests/MIT