copilot-cli provider: token_usage missing output tokens for target agent

## Problem

The `copilot-cli` provider reports incomplete `token_usage` for the target agent being evaluated. Output tokens are **hardcoded to 0**, so eval results show misleading token counts:

```yaml
eval_run:
  duration_ms: 24613
  token_usage:
    input: 698   # actually cumulative context window, not true input tokens
    output: 239  # this comes from grader only — target output is 0
```

The `token_usage.output` in `eval_run` is entirely from the LLM grader. The target agent's output tokens are missing.

## Root Cause

**File:** `packages/core/src/evaluation/providers/copilot-cli.ts:176-190`

The provider only reads the ACP `usage_update` session notification, which provides cumulative context window metrics (`{ size, used, cost? }`), not per-turn token breakdowns:

```typescript
if (sessionUpdate === 'usage_update') {
  if (tokenUsage) {
    tokenUsage = { input: update.used, output: tokenUsage.output };
  } else {
    tokenUsage = { input: update.used, output: 0 };  // ← output hardcoded to 0
  }
}
```

The ACP spec defines a **`PromptResponse`** type with granular per-turn token fields (`input_tokens`, `output_tokens`, `thought_tokens`, `cached_read_tokens`, `cached_write_tokens`), but the copilot-cli provider never reads this — it only reads the session-level `usage_update` which has a single `used` count representing total context window occupancy.

## Why other providers work

| Provider | Token source | Input | Output | Status |
|----------|-------------|-------|--------|--------|
| `copilot-cli` | ACP `usage_update.used` | Cumulative context (mislabeled as input) | Hardcoded 0 | **Broken** |
| `copilot-sdk` | SDK `assistant.usage` event | Correct | Correct | Working — but doesn't support long-running processes |
| `claude-cli` | `result` event `.usage` | Correct | Correct | Working |
| `ai-sdk` (OpenAI/Anthropic) | API response `.usage` | Correct | Correct | Working |

## Why copilot-sdk is not a workaround

The `copilot-sdk` provider properly extracts both `inputTokens` and `outputTokens` from `assistant.usage` events. However, **copilot-sdk does not support long-running agent processes** (e.g., multi-turn coding tasks that run for minutes), making it unsuitable as a drop-in replacement for eval workloads that need copilot-cli.

## Upstream issue

This is partially blocked by the ACP protocol's `usage_update` event not exposing per-turn token breakdowns. An upstream issue exists:

- [github/copilot-cli#1152 — More Verbose Token Information](https://github.com/github/copilot-cli/issues/1152)

The ACP spec's `PromptResponse` type _does_ include `input_tokens`/`output_tokens`, but it's unclear whether copilot-cli currently emits this event or if the `@agentclientprotocol/sdk` surfaces it to client code.

## Proposed fix

### Option A: Extract from `PromptResponse` (if available)

If the ACP SDK surfaces `PromptResponse` events with per-turn token breakdowns, handle them alongside `usage_update`:

```typescript
if (sessionUpdate === 'prompt_response' || event.type === 'PromptResponse') {
  tokenUsage = {
    input: (tokenUsage?.input ?? 0) + (update.input_tokens ?? 0),
    output: (tokenUsage?.output ?? 0) + (update.output_tokens ?? 0),
    reasoning: update.thought_tokens,
    cached: update.cached_read_tokens,
  };
}
```

### Option B: Estimate from message content (workaround)

If `PromptResponse` is not available from copilot-cli, estimate input/output split from observable data:

1. Track total characters sent to the agent (prompt + tool results) as proxy for input
2. Track total characters in `agent_message_chunk` events as proxy for output  
3. Use the `usage_update.used` total and pro-rata it based on the input/output character ratio

```typescript
// Rough estimation: split usage_update.used proportionally
const totalChars = inputChars + outputChars;
if (totalChars > 0) {
  tokenUsage = {
    input: Math.round(update.used * (inputChars / totalChars)),
    output: Math.round(update.used * (outputChars / totalChars)),
  };
}
```

This is an approximation (char-to-token ratio varies), but better than `output: 0`.

### Option C: Wait for upstream

If copilot-cli#1152 lands with proper per-turn token reporting, update the provider to consume the new fields.

## Acceptance signals

- [ ] `copilot-cli` provider reports non-zero `output` tokens
- [ ] `token_usage.input` reflects actual input tokens (not cumulative context window)
- [ ] Eval results for copilot-cli targets show meaningful token breakdowns comparable to other providers

## Non-goals

- Changing the copilot-sdk provider
- Modifying the ACP protocol itself
- Achieving exact parity with API-based providers (estimation is acceptable if flagged)

## Related

- Upstream: [github/copilot-cli#1152](https://github.com/github/copilot-cli/issues/1152)
- ACP spec session usage: https://agentclientprotocol.com/rfds/session-usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copilot-cli provider: token_usage missing output tokens for target agent #683

Problem

Root Cause

Why other providers work

Why copilot-sdk is not a workaround

Upstream issue

Proposed fix

Option A: Extract from `PromptResponse` (if available)

Option B: Estimate from message content (workaround)

Option C: Wait for upstream

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provider	Token source	Input	Output	Status
`copilot-cli`	ACP `usage_update.used`	Cumulative context (mislabeled as input)	Hardcoded 0	Broken
`copilot-sdk`	SDK `assistant.usage` event	Correct	Correct	Working — but doesn't support long-running processes
`claude-cli`	`result` event `.usage`	Correct	Correct	Working
`ai-sdk` (OpenAI/Anthropic)	API response `.usage`	Correct	Correct	Working

copilot-cli provider: token_usage missing output tokens for target agent #683

Description

Problem

Root Cause

Why other providers work

Why copilot-sdk is not a workaround

Upstream issue

Proposed fix

Option A: Extract from PromptResponse (if available)

Option B: Estimate from message content (workaround)

Option C: Wait for upstream

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: Extract from `PromptResponse` (if available)