Skip to content

Add per-task and per-model token breakdown to end-of-session summary#5

Open
samkeen wants to merge 2 commits intomainfrom
claude/add-per-model-breakdown-0RBxr
Open

Add per-task and per-model token breakdown to end-of-session summary#5
samkeen wants to merge 2 commits intomainfrom
claude/add-per-model-breakdown-0RBxr

Conversation

@samkeen
Copy link
Copy Markdown
Contributor

@samkeen samkeen commented May 3, 2026

Summary

This PR adds visibility into token spending by introducing a per-task and per-model breakdown in the end-of-session summary. The breakdown is computed by replaying events.jsonl rather than relying on in-memory state, ensuring it survives crashes and --resume cycles.

Key Changes

  • Instrumented all three model call sites to emit model_call events with a kind field:

    • _run_task() (worker): kind="worker"
    • _judge_task() (judge): kind="judge"
    • _self_improve() (self-improvement): kind="self_improve"
    • Each event now includes the model field for attribution across providers
  • Added _token_breakdowns() function that replays events.jsonl to aggregate tokens:

    • Returns (by_model, by_task) tuples
    • by_model: maps model name to total tokens
    • by_task: maps task ID to a dict with total and per-kind breakdowns
    • Gracefully handles missing files, malformed JSON lines, and legacy events without kind/model fields (defaults to "unknown"/"worker")
    • Skips zero-token events to avoid polluting the breakdown
  • Enhanced end-of-session summary to display the breakdowns:

    • Per-task section sorted by token usage (descending), showing total and split by kind
    • Per-model section sorted by token usage (descending)
    • Helps identify which tasks consumed the most budget and which models were used
  • Updated visualization (render.py) to display model call metadata:

    • Shows kind in badge labels (e.g., "judge (iter 3)" vs "iter 1")
    • Includes model name in the meta strip when available
    • Handles legacy events without kind/model gracefully
  • Documentation updates (USAGE.md, deep-dives.md, session.py):

    • Documented the new per-task and per-model breakdown feature
    • Clarified that all three sites now emit model_call events
    • Explained the replay-based approach and its benefits

Implementation Details

  • The breakdown computation is defensive: it tolerates missing fields, malformed JSON, and zero-token events
  • Legacy events (pre-breakdown) are bucketed under "unknown" model and "worker" kind, maintaining backward compatibility
  • The iter field is only emitted for worker and judge calls (not self-improve), as self-improvement doesn't have a meaningful iteration number
  • Token counts are computed as prompt_tokens + eval_tokens, matching the existing add_tokens() pattern

https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ

claude added 2 commits May 3, 2026 15:06
Judge and self_improve sites now emit `model_call` events too — symmetric with
the worker site so a single events.jsonl replay aggregates spend by task and by
model. The end-of-session summary prints both, with per-task split by kind
(worker / judge / self_improve) when relevant. Closes the "no per-model
breakdown" gap called out in deep-dives.md §2.

https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ
Adds a "Where the tokens went" subsection with a sample summary block, and
points the "Costs are real" caveat at it. Section 4 is where users learn what
they see when a run ends — the breakdown is the main user-facing artifact of
the change in the previous commit.

https://claude.ai/code/session_01DYjQrDCX3FaFBppR9NisKZ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants