Skip to content

copilot-cli provider: token_usage missing output tokens for target agent #683

@christso

Description

@christso

Problem

The copilot-cli provider reports incomplete token_usage for the target agent being evaluated. Output tokens are hardcoded to 0, so eval results show misleading token counts:

eval_run:
  duration_ms: 24613
  token_usage:
    input: 698   # actually cumulative context window, not true input tokens
    output: 239  # this comes from grader only — target output is 0

The token_usage.output in eval_run is entirely from the LLM grader. The target agent's output tokens are missing.

Root Cause

File: packages/core/src/evaluation/providers/copilot-cli.ts:176-190

The provider only reads the ACP usage_update session notification, which provides cumulative context window metrics ({ size, used, cost? }), not per-turn token breakdowns:

if (sessionUpdate === 'usage_update') {
  if (tokenUsage) {
    tokenUsage = { input: update.used, output: tokenUsage.output };
  } else {
    tokenUsage = { input: update.used, output: 0 };  // ← output hardcoded to 0
  }
}

The ACP spec defines a PromptResponse type with granular per-turn token fields (input_tokens, output_tokens, thought_tokens, cached_read_tokens, cached_write_tokens), but the copilot-cli provider never reads this — it only reads the session-level usage_update which has a single used count representing total context window occupancy.

Why other providers work

Provider Token source Input Output Status
copilot-cli ACP usage_update.used Cumulative context (mislabeled as input) Hardcoded 0 Broken
copilot-sdk SDK assistant.usage event Correct Correct Working — but doesn't support long-running processes
claude-cli result event .usage Correct Correct Working
ai-sdk (OpenAI/Anthropic) API response .usage Correct Correct Working

Why copilot-sdk is not a workaround

The copilot-sdk provider properly extracts both inputTokens and outputTokens from assistant.usage events. However, copilot-sdk does not support long-running agent processes (e.g., multi-turn coding tasks that run for minutes), making it unsuitable as a drop-in replacement for eval workloads that need copilot-cli.

Upstream issue

This is partially blocked by the ACP protocol's usage_update event not exposing per-turn token breakdowns. An upstream issue exists:

The ACP spec's PromptResponse type does include input_tokens/output_tokens, but it's unclear whether copilot-cli currently emits this event or if the @agentclientprotocol/sdk surfaces it to client code.

Proposed fix

Option A: Extract from PromptResponse (if available)

If the ACP SDK surfaces PromptResponse events with per-turn token breakdowns, handle them alongside usage_update:

if (sessionUpdate === 'prompt_response' || event.type === 'PromptResponse') {
  tokenUsage = {
    input: (tokenUsage?.input ?? 0) + (update.input_tokens ?? 0),
    output: (tokenUsage?.output ?? 0) + (update.output_tokens ?? 0),
    reasoning: update.thought_tokens,
    cached: update.cached_read_tokens,
  };
}

Option B: Estimate from message content (workaround)

If PromptResponse is not available from copilot-cli, estimate input/output split from observable data:

  1. Track total characters sent to the agent (prompt + tool results) as proxy for input
  2. Track total characters in agent_message_chunk events as proxy for output
  3. Use the usage_update.used total and pro-rata it based on the input/output character ratio
// Rough estimation: split usage_update.used proportionally
const totalChars = inputChars + outputChars;
if (totalChars > 0) {
  tokenUsage = {
    input: Math.round(update.used * (inputChars / totalChars)),
    output: Math.round(update.used * (outputChars / totalChars)),
  };
}

This is an approximation (char-to-token ratio varies), but better than output: 0.

Option C: Wait for upstream

If copilot-cli#1152 lands with proper per-turn token reporting, update the provider to consume the new fields.

Acceptance signals

  • copilot-cli provider reports non-zero output tokens
  • token_usage.input reflects actual input tokens (not cumulative context window)
  • Eval results for copilot-cli targets show meaningful token breakdowns comparable to other providers

Non-goals

  • Changing the copilot-sdk provider
  • Modifying the ACP protocol itself
  • Achieving exact parity with API-based providers (estimation is acceptable if flagged)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions