Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ example sessions before changing models.
Run rate $160.3500/mo — 19% of cycle budget unused.
```

Two analyzers reading the same spans you'd otherwise pay LangSmith to host: structural model-downgrade candidate flagging (never claims quality equivalence — surfaces examples to review) and per-provider monthly budget projection. Works with **any** agent already sending TokenJam data, not just Claude Code.
Two analyzers reading the same spans you'd otherwise pay LangSmith to host: structural downsize candidate flagging (never claims quality equivalence — surfaces examples to review) and per-provider monthly budget projection. Works with **any** agent already sending TokenJam data, not just Claude Code.

Try a tighter budget to see the over-budget renderer:

Expand Down
8 changes: 4 additions & 4 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Set limits via CLI (`tj budget --daily 10`), the REST API (`POST /api/v1/budget`

## Content capture and privacy

By default, `tj` does not capture prompt content, completion content, or tool inputs/outputs — only token counts, model names, tool names, timestamps, and structural metadata. Enable content capture selectively when you need it (for debugging, prompt-bloat analysis, or evaluation):
By default, `tj` does not capture prompt content, completion content, or tool inputs/outputs — only token counts, model names, tool names, timestamps, and structural metadata. Enable content capture selectively when you need it (for debugging, trim analysis, or evaluation):

```toml
[capture]
Expand All @@ -108,9 +108,9 @@ The four flags are independent: capture prompts without completions, or tool inp

**What this means for downstream analyzers.**

- `tj optimize --finding cache-efficacy` reads token-count fields and works without content capture.
- `tj optimize --finding prompt-bloat` reads prompt text and requires `capture.prompts = true`.
- `tj optimize --finding cache-recommend` reads prompts and requires `capture.prompts = true`.
- `tj optimize cache` reads token-count fields and works without content capture.
- `tj optimize trim` reads prompt text and requires `capture.prompts = true`.
- `tj optimize cache-recommend` reads prompts and requires `capture.prompts = true`.

The analyzers that need content fail with a clear message ("set `capture.prompts = true` in tj.toml and let the daemon collect a fresh window of data") rather than running on partial data.

Expand Down
4 changes: 2 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ TokenJam keeps heavyweight ML dependencies, framework adapters, and the MCP serv
| Extra | What it pulls in | Why it's optional |
|---|---|---|
| `tokenjam[mcp]` | `fastmcp` | Only needed for the Claude Code / Codex MCP server (`tj mcp`). Pulled by `tj onboard --claude-code` automatically when invoked through the documented one-liner. |
| `tokenjam[bloat]` | `llmlingua>=0.2`, transitively PyTorch + transformers (~2GB) | The Trim analyzer (`tj optimize --finding prompt-bloat`) scores token significance with LLMLingua-2. Most users don't run it; keeping torch out of the base install means `pip install tokenjam` stays small and fast on machines that don't have a GPU/CPU build of torch already. |
| `tokenjam[bloat]` | `llmlingua>=0.2`, transitively PyTorch + transformers (~2GB) | The Trim analyzer (`tj optimize trim`) scores token significance with LLMLingua-2. Most users don't run it; keeping torch out of the base install means `pip install tokenjam` stays small and fast on machines that don't have a GPU/CPU build of torch already. |
| `tokenjam[langchain]` | `langchain>=0.2` | Convenience pin for `patch_langchain()`; you can also install langchain yourself. |
| `tokenjam[crewai]` | `crewai>=0.28` | Convenience pin for `patch_crewai()`. |
| `tokenjam[autogen]` | `pyautogen>=0.2` | Convenience pin for `patch_autogen()`. |
Expand All @@ -44,7 +44,7 @@ pip install "tokenjam[mcp,bloat]"

`tokenjam[bloat]` is the largest extra — LLMLingua-2 transitively pulls in PyTorch and Hugging Face transformers, roughly 2GB on disk. On first run the analyzer downloads a ~110MB BERT-class classifier model under `~/.cache/tokenjam/models/` (override via `TOKENJAM_MODEL_CACHE`); subsequent runs are offline-capable.

If you run `tj optimize --finding prompt-bloat` without the extra installed, the analyzer self-registers and exits with a clear hint pointing at this install command — nothing in the base install crashes from its absence.
If you run `tj optimize trim` without the extra installed, the analyzer self-registers and exits with a clear hint pointing at this install command — nothing in the base install crashes from its absence.

See [`docs/optimize/trim.md`](optimize/trim.md) for performance numbers, capture requirements, and what the analyzer actually reports.

Expand Down
8 changes: 4 additions & 4 deletions docs/optimize/cache.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Cache

Product name: **Cache**. Internal/CLI names: `cache-efficacy` and `cache-recommend`. Two related findings under the same product — both surface prompt-caching opportunities; they differ in what they need and what they recommend.
Product name: **Cache**. Internal/CLI names: `cache` and `cache-recommend`. Two related findings under the same product — both surface prompt-caching opportunities; they differ in what they need and what they recommend.

```bash
tj optimize --finding cache-efficacy
tj optimize --finding cache-recommend
tj optimize cache
tj optimize cache-recommend
```

## `cache-efficacy` — measure current caching
## `cache` — measure current caching

Reads aggregate `input_tokens` and `cache_tokens` from spans in the
window. Computes the share of input bytes served from cache per
Expand Down
6 changes: 3 additions & 3 deletions docs/optimize/downsize.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Downsize

Product name: **Downsize**. Internal/CLI name: `model-downgrade`.
Product name: **Downsize**. Internal/CLI name: `downsize`.

```bash
tj optimize --finding model-downgrade
tj optimize downsize
```

Flags sessions whose structural shape — short input (< 5K tokens), short output (< 500 tokens), few tool calls (≤ 5) — matches a class of work where a cheaper model in the same provider family is worth reviewing.
Expand Down Expand Up @@ -37,7 +37,7 @@ JSON output mirrors the same data with top-level `plan` and `pricing_mode` field

## Confidence

`structural`. The model-downgrade finding identifies a structural pattern in the captured data; it does not validate that the cheaper model would produce equivalent output. The mandatory caveat is the honest framing of that limitation.
`structural`. The downsize finding identifies a structural pattern in the captured data; it does not validate that the cheaper model would produce equivalent output. The mandatory caveat is the honest framing of that limitation.

## See also

Expand Down
4 changes: 2 additions & 2 deletions docs/optimize/script.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Script

Product name: **Script**. Internal/CLI name: `workflow-restructure`.
Product name: **Script**. Internal/CLI name: `script`.

```bash
tj optimize --finding workflow-restructure
tj optimize script
```

Flags sessions whose tool-call sequence is structurally identical
Expand Down
14 changes: 7 additions & 7 deletions docs/optimize/trim.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Trim

Product name: **Trim**. Internal/CLI name: `prompt-bloat`.
Product name: **Trim**. Internal/CLI name: `trim`.

```bash
tj optimize --finding prompt-bloat
tj optimize trim
```

Scores token-by-token significance in captured prompts using
Expand All @@ -26,7 +26,7 @@ pip install "tokenjam[bloat]"
```

The base `pip install tokenjam` does NOT pull torch. Trim shows up in
`tj optimize --finding` choices regardless, but running it without the
`tj optimize` analyzer choices regardless, but running it without the
extra prints a clear install hint and exits.

## Requirements
Expand Down Expand Up @@ -56,10 +56,10 @@ marked.
## HTML report

```bash
tj report --bloat # all agents, 30d window
tj report --bloat my-agent # scope to one agent
tj report --bloat --since 7d # custom window
tj report --bloat --no-open # write file without opening browser
tj report --trim # all agents, 30d window
tj report --trim my-agent # scope to one agent
tj report --trim --since 7d # custom window
tj report --trim --no-open # write file without opening browser
```

Output goes to `~/.cache/tokenjam/reports/trim-<timestamp>.html` and
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "tokenjam"
version = "0.3.0"
version = "0.3.1"
description = "TokenJam — local-first OTel-native observability for Autonomous AI agents"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
2 changes: 1 addition & 1 deletion sdk-ts/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@tokenjam/sdk",
"version": "0.3.0",
"version": "0.3.1",
"description": "TypeScript SDK for TokenJam — local-first observability for AI agents",
"main": "dist/index.js",
"types": "dist/index.d.ts",
Expand Down
2 changes: 1 addition & 1 deletion tests/integration/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -816,7 +816,7 @@ def test_optimize_budget_projection_from_config(runner, db):
)
db.insert_span(span)

result = _invoke(runner, db, cfg, ["optimize", "--finding", "budget-projection"])
result = _invoke(runner, db, cfg, ["optimize", "budget-projection"])
assert result.exit_code == 0
assert "Budget projection" in result.output
assert "anthropic" in result.output
Expand Down
12 changes: 6 additions & 6 deletions tests/manual-new-release-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,11 @@ tj doctor # exit 0 or 1 (warnings ok)

```bash
tj optimize # all analyzers
tj optimize --finding model-downgrade # Downsize
tj optimize --finding cache-efficacy # Cache (efficacy)
tj optimize --finding cache-recommend # Cache (recommend) — surfaces "enable capture.prompts" if not set
tj optimize --finding workflow-restructure # Script — likely no candidates on a fresh DB
tj optimize --finding prompt-bloat # Trim — should print install hint without [bloat] extra
tj optimize downsize # Downsize
tj optimize cache # Cache (efficacy)
tj optimize cache-recommend # Cache (recommend) — surfaces "enable capture.prompts" if not set
tj optimize script # Script — likely no candidates on a fresh DB
tj optimize trim # Trim — should print install hint without [bloat] extra

# Caveat enforcement on the downgrade finding
tj optimize --json | python3 -c \
Expand All @@ -64,7 +64,7 @@ tj optimize --json | python3 -c \
"import json,sys;d=json.load(sys.stdin);assert 'plan' in d and 'pricing_mode' in d;print('ok')"
```

**Pass criteria:** every `--finding` runs without crashing. Optional analyzers (`cache-recommend`, `prompt-bloat`) surface clear hints when their prereqs aren't met instead of erroring.
**Pass criteria:** every positional analyzer name runs without crashing. Optional analyzers (`cache-recommend`, `trim`) surface clear hints when their prereqs aren't met instead of erroring.

## 5. Backfill adapters (smoke against committed fixtures)

Expand Down
30 changes: 15 additions & 15 deletions tests/manual-pre-release-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,61 +106,61 @@ tj doctor # exit 0 (or 1 with warnings); no errors

## 6. Cost-optimization analyzers (the four products)

### 6a. Downsize (`--finding model-downgrade`)
### 6a. Downsize

```bash
tj optimize --finding model-downgrade
tj optimize downsize
# [ ] If candidates found: caveat "Candidate-flagging heuristic, not a quality judgment." is present
# [ ] If no candidates: clean "No candidates flagged" message — not a crash
tj optimize --finding model-downgrade --json | python3 -c \
tj optimize downsize --json | python3 -c \
"import json,sys;r=json.load(sys.stdin);d=r.get('downgrade');assert d is None or 'Candidate-flagging heuristic' in d['caveat'];print('ok: caveat enforced')"
```

### 6b. Cache (`--finding cache-efficacy` + `--finding cache-recommend`)
### 6b. Cache

```bash
# cache-efficacy works without content capture
tj optimize --finding cache-efficacy
# cache works without content capture
tj optimize cache
# [ ] Per (provider, model) rows. Anthropic shows numerically-accurate ratios.
# [ ] OpenAI / Gemini rows (if present) carry the best-effort caveat.
# [ ] Rows for Bedrock / LiteLLM / Cohere (if present) show unsupported, not flagged.

# cache-recommend needs capture.prompts. Without it, the analyzer returns a hint.
tj optimize --finding cache-recommend
tj optimize cache-recommend
# [ ] Without capture.prompts: surfaces the "enable capture.prompts" hint, doesn't crash.
# [ ] With capture.prompts and ≥3 calls sharing a long prefix: surfaces breakpoint candidates.
```

To exercise the content-needed branch, set `capture.prompts = true` in `.tj/config.toml`, re-run an example, then re-run `cache-recommend`.

### 6c. Script (`--finding workflow-restructure`)
### 6c. Script

```bash
tj optimize --finding workflow-restructure
tj optimize script
# [ ] If ≥20 sessions match a single (tool_name, arg_shape) signature: cluster surfaces with
# "review carefully" caveat. With v1's conservative thresholds, most fresh test DBs see no candidates.
# [ ] No crash; "no clusters found" message is acceptable.
```

### 6d. Trim (`--finding prompt-bloat`)
### 6d. Trim

Trim requires the optional `tokenjam[bloat]` extra (LLMLingua-2 + torch + transformers, ~2GB).

```bash
# Without the extra installed: self-registers, errors gracefully with install hint.
tj optimize --finding prompt-bloat
tj optimize trim
# [ ] Output points the user at: pip install "tokenjam[bloat]"

# Install the extra and re-run — only do this if you actually want to test Trim end-to-end
# (the 2GB download is real). Skip this on quick passes.
pip3 install -e ".[dev,mcp,bloat]"
# Enable capture.prompts in .tj/config.toml, re-run examples to populate content, then:
tj optimize --finding prompt-bloat
tj optimize trim
# [ ] First run downloads the ~110MB BERT model under ~/.cache/tokenjam/models/.
# [ ] Subsequent runs are offline.

# HTML report renderer
tj report --bloat --no-open
tj report --trim --no-open
# [ ] Writes to ~/.cache/tokenjam/reports/trim-<timestamp>.html
# [ ] HTML contains a caveat block + per-prompt sections
```
Expand Down Expand Up @@ -400,8 +400,8 @@ python3 examples/single_provider/anthropic_agent.py

tj status && tj traces && tj cost --since 1h
tj optimize # all analyzers
tj optimize --finding model-downgrade
tj optimize --finding cache-efficacy
tj optimize downsize
tj optimize cache
tj backfill langfuse --source-file tests/fixtures/langfuse_real_response.json
tj backfill helicone --source-file tests/fixtures/helicone_real_response.json
tj backfill otlp --source-file tests/fixtures/otlp_sample.json
Expand Down
8 changes: 4 additions & 4 deletions tests/unit/test_cache_efficacy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Unit tests for the cache-efficacy analyzer."""
"""Unit tests for the cache analyzer."""
from __future__ import annotations

from datetime import datetime, timedelta, timezone
Expand Down Expand Up @@ -140,10 +140,10 @@ def test_run_integrates_with_build_report(db, config):
until = datetime(2026, 5, 30, tzinfo=timezone.utc)
report = build_report(
db=db, config=config, since=since, until=until,
findings=["cache-efficacy"],
findings=["cache"],
)
assert "cache-efficacy" in report.findings
finding = report.findings["cache-efficacy"]
assert "cache" in report.findings
finding = report.findings["cache"]
assert finding.confidence == "structural"
assert len(finding.rows) == 1
assert len(finding.flagged) == 1
18 changes: 9 additions & 9 deletions tests/unit/test_prompt_bloat.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Unit tests for the prompt-bloat (Trim) analyzer.
Unit tests for the trim (Trim) analyzer.

The LLMLingua-2 model is mocked across these tests so CI doesn't have to
download ~110MB and run an actual BERT classifier. The mock returns
Expand Down Expand Up @@ -83,8 +83,8 @@ def test_disabled_without_capture_prompts(db):
since = datetime(2026, 5, 1, tzinfo=timezone.utc)
until = datetime(2026, 5, 30, tzinfo=timezone.utc)
report = build_report(db=db, config=config, since=since, until=until,
findings=["prompt-bloat"])
finding = report.findings["prompt-bloat"]
findings=["trim"])
finding = report.findings["trim"]
assert isinstance(finding, PromptBloatFinding)
assert finding.enabled is False
assert "capture" in finding.hint.lower()
Expand All @@ -98,8 +98,8 @@ def test_disabled_when_llmlingua_missing(db, monkeypatch):
since = datetime(2026, 5, 1, tzinfo=timezone.utc)
until = datetime(2026, 5, 30, tzinfo=timezone.utc)
report = build_report(db=db, config=config, since=since, until=until,
findings=["prompt-bloat"])
finding = report.findings["prompt-bloat"]
findings=["trim"])
finding = report.findings["trim"]
assert finding.enabled is False
assert "tokenjam[bloat]" in finding.hint

Expand Down Expand Up @@ -153,8 +153,8 @@ def test_scores_prompts_and_finds_bloat(db, monkeypatch):
since = datetime(2026, 5, 1, tzinfo=timezone.utc)
until = datetime(2026, 5, 30, tzinfo=timezone.utc)
report = build_report(db=db, config=config, since=since, until=until,
findings=["prompt-bloat"])
finding = report.findings["prompt-bloat"]
findings=["trim"])
finding = report.findings["trim"]
assert finding.enabled is True
assert finding.prompts_scored == 3
# Each scored prompt produces one BloatPrompt entry, up to the 10 cap.
Expand All @@ -173,8 +173,8 @@ def test_skips_short_prompts(db, monkeypatch):
since = datetime(2026, 5, 1, tzinfo=timezone.utc)
until = datetime(2026, 5, 30, tzinfo=timezone.utc)
report = build_report(db=db, config=config, since=since, until=until,
findings=["prompt-bloat"])
finding = report.findings["prompt-bloat"]
findings=["trim"])
finding = report.findings["trim"]
assert finding.prompts_scored == 0
assert finding.prompts_skipped == 5
assert finding.per_prompt == []
Loading
Loading