Add per-task and per-model token breakdown to end-of-session summary#5
Open
Add per-task and per-model token breakdown to end-of-session summary#5
Conversation
Judge and self_improve sites now emit `model_call` events too — symmetric with the worker site so a single events.jsonl replay aggregates spend by task and by model. The end-of-session summary prints both, with per-task split by kind (worker / judge / self_improve) when relevant. Closes the "no per-model breakdown" gap called out in deep-dives.md §2. https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ
Adds a "Where the tokens went" subsection with a sample summary block, and points the "Costs are real" caveat at it. Section 4 is where users learn what they see when a run ends — the breakdown is the main user-facing artifact of the change in the previous commit. https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds visibility into token spending by introducing a per-task and per-model breakdown in the end-of-session summary. The breakdown is computed by replaying
events.jsonlrather than relying on in-memory state, ensuring it survives crashes and--resumecycles.Key Changes
Instrumented all three model call sites to emit
model_callevents with akindfield:_run_task()(worker):kind="worker"_judge_task()(judge):kind="judge"_self_improve()(self-improvement):kind="self_improve"modelfield for attribution across providersAdded
_token_breakdowns()function that replaysevents.jsonlto aggregate tokens:(by_model, by_task)tuplesby_model: maps model name to total tokensby_task: maps task ID to a dict withtotaland per-kindbreakdownskind/modelfields (defaults to "unknown"/"worker")Enhanced end-of-session summary to display the breakdowns:
Updated visualization (
render.py) to display model call metadata:kindin badge labels (e.g., "judge (iter 3)" vs "iter 1")kind/modelgracefullyDocumentation updates (USAGE.md, deep-dives.md, session.py):
model_calleventsImplementation Details
iterfield is only emitted for worker and judge calls (not self-improve), as self-improvement doesn't have a meaningful iteration numberprompt_tokens + eval_tokens, matching the existingadd_tokens()patternhttps://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ