Add per-task and per-model token breakdown to end-of-session summary by samkeen · Pull Request #5 · AlteredCraft/tilth

samkeen · 2026-05-03T15:33:31Z

Summary

This PR adds visibility into token spending by introducing a per-task and per-model breakdown in the end-of-session summary. The breakdown is computed by replaying events.jsonl rather than relying on in-memory state, ensuring it survives crashes and --resume cycles.

Key Changes

Instrumented all three model call sites to emit model_call events with a kind field:
- _run_task() (worker): kind="worker"
- _judge_task() (judge): kind="judge"
- _self_improve() (self-improvement): kind="self_improve"
- Each event now includes the model field for attribution across providers
Added _token_breakdowns() function that replays events.jsonl to aggregate tokens:
- Returns (by_model, by_task) tuples
- by_model: maps model name to total tokens
- by_task: maps task ID to a dict with total and per-kind breakdowns
- Gracefully handles missing files, malformed JSON lines, and legacy events without kind/model fields (defaults to "unknown"/"worker")
- Skips zero-token events to avoid polluting the breakdown
Enhanced end-of-session summary to display the breakdowns:
- Per-task section sorted by token usage (descending), showing total and split by kind
- Per-model section sorted by token usage (descending)
- Helps identify which tasks consumed the most budget and which models were used
Updated visualization (render.py) to display model call metadata:
- Shows kind in badge labels (e.g., "judge (iter 3)" vs "iter 1")
- Includes model name in the meta strip when available
- Handles legacy events without kind/model gracefully
Documentation updates (USAGE.md, deep-dives.md, session.py):
- Documented the new per-task and per-model breakdown feature
- Clarified that all three sites now emit model_call events
- Explained the replay-based approach and its benefits

Implementation Details

The breakdown computation is defensive: it tolerates missing fields, malformed JSON, and zero-token events
Legacy events (pre-breakdown) are bucketed under "unknown" model and "worker" kind, maintaining backward compatibility
The iter field is only emitted for worker and judge calls (not self-improve), as self-improvement doesn't have a meaningful iteration number
Token counts are computed as prompt_tokens + eval_tokens, matching the existing add_tokens() pattern

https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ

Judge and self_improve sites now emit `model_call` events too — symmetric with the worker site so a single events.jsonl replay aggregates spend by task and by model. The end-of-session summary prints both, with per-task split by kind (worker / judge / self_improve) when relevant. Closes the "no per-model breakdown" gap called out in deep-dives.md §2. https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ

Adds a "Where the tokens went" subsection with a sample summary block, and points the "Costs are real" caveat at it. Section 4 is where users learn what they see when a run ends — the breakdown is the main user-facing artifact of the change in the previous commit. https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ

claude added 2 commits May 3, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-task and per-model token breakdown to end-of-session summary#5

Add per-task and per-model token breakdown to end-of-session summary#5
samkeen wants to merge 2 commits intomainfrom
claude/add-per-model-breakdown-0RBxr

samkeen commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samkeen commented May 3, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants