-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The copilot-cli provider reports incomplete token_usage for the target agent being evaluated. Output tokens are hardcoded to 0, so eval results show misleading token counts:
eval_run:
duration_ms: 24613
token_usage:
input: 698 # actually cumulative context window, not true input tokens
output: 239 # this comes from grader only — target output is 0The token_usage.output in eval_run is entirely from the LLM grader. The target agent's output tokens are missing.
Root Cause
File: packages/core/src/evaluation/providers/copilot-cli.ts:176-190
The provider only reads the ACP usage_update session notification, which provides cumulative context window metrics ({ size, used, cost? }), not per-turn token breakdowns:
if (sessionUpdate === 'usage_update') {
if (tokenUsage) {
tokenUsage = { input: update.used, output: tokenUsage.output };
} else {
tokenUsage = { input: update.used, output: 0 }; // ← output hardcoded to 0
}
}The ACP spec defines a PromptResponse type with granular per-turn token fields (input_tokens, output_tokens, thought_tokens, cached_read_tokens, cached_write_tokens), but the copilot-cli provider never reads this — it only reads the session-level usage_update which has a single used count representing total context window occupancy.
Why other providers work
| Provider | Token source | Input | Output | Status |
|---|---|---|---|---|
copilot-cli |
ACP usage_update.used |
Cumulative context (mislabeled as input) | Hardcoded 0 | Broken |
copilot-sdk |
SDK assistant.usage event |
Correct | Correct | Working — but doesn't support long-running processes |
claude-cli |
result event .usage |
Correct | Correct | Working |
ai-sdk (OpenAI/Anthropic) |
API response .usage |
Correct | Correct | Working |
Why copilot-sdk is not a workaround
The copilot-sdk provider properly extracts both inputTokens and outputTokens from assistant.usage events. However, copilot-sdk does not support long-running agent processes (e.g., multi-turn coding tasks that run for minutes), making it unsuitable as a drop-in replacement for eval workloads that need copilot-cli.
Upstream issue
This is partially blocked by the ACP protocol's usage_update event not exposing per-turn token breakdowns. An upstream issue exists:
The ACP spec's PromptResponse type does include input_tokens/output_tokens, but it's unclear whether copilot-cli currently emits this event or if the @agentclientprotocol/sdk surfaces it to client code.
Proposed fix
Option A: Extract from PromptResponse (if available)
If the ACP SDK surfaces PromptResponse events with per-turn token breakdowns, handle them alongside usage_update:
if (sessionUpdate === 'prompt_response' || event.type === 'PromptResponse') {
tokenUsage = {
input: (tokenUsage?.input ?? 0) + (update.input_tokens ?? 0),
output: (tokenUsage?.output ?? 0) + (update.output_tokens ?? 0),
reasoning: update.thought_tokens,
cached: update.cached_read_tokens,
};
}Option B: Estimate from message content (workaround)
If PromptResponse is not available from copilot-cli, estimate input/output split from observable data:
- Track total characters sent to the agent (prompt + tool results) as proxy for input
- Track total characters in
agent_message_chunkevents as proxy for output - Use the
usage_update.usedtotal and pro-rata it based on the input/output character ratio
// Rough estimation: split usage_update.used proportionally
const totalChars = inputChars + outputChars;
if (totalChars > 0) {
tokenUsage = {
input: Math.round(update.used * (inputChars / totalChars)),
output: Math.round(update.used * (outputChars / totalChars)),
};
}This is an approximation (char-to-token ratio varies), but better than output: 0.
Option C: Wait for upstream
If copilot-cli#1152 lands with proper per-turn token reporting, update the provider to consume the new fields.
Acceptance signals
-
copilot-cliprovider reports non-zerooutputtokens -
token_usage.inputreflects actual input tokens (not cumulative context window) - Eval results for copilot-cli targets show meaningful token breakdowns comparable to other providers
Non-goals
- Changing the copilot-sdk provider
- Modifying the ACP protocol itself
- Achieving exact parity with API-based providers (estimation is acceptable if flagged)
Related
- Upstream: github/copilot-cli#1152
- ACP spec session usage: https://agentclientprotocol.com/rfds/session-usage.md