Skip to content

Add OpenAI Agents auto-instrumentation with real e2e coverage#1891

Open
Stephen Belanger (Qard) wants to merge 3 commits intomainfrom
openai-agents-auto-instrumentation
Open

Add OpenAI Agents auto-instrumentation with real e2e coverage#1891
Stephen Belanger (Qard) wants to merge 3 commits intomainfrom
openai-agents-auto-instrumentation

Conversation

@Qard
Copy link
Copy Markdown
Contributor

@Qard Stephen Belanger (Qard) commented Apr 22, 2026

Summary

  • Add auto-instrumentation support for @openai/agents-core tracing lifecycle methods and wire it into the default Braintrust hook/plugin stack
  • Map OpenAI Agents traces and spans into Braintrust spans, including agent, tool, handoff, guardrail, and model activity
  • Add a real e2e scenario that runs @openai/agents through --import braintrust/hook.mjs and verifies captured spans against the live API path
  • Fix ESM loader fallback behavior so transformed .mjs modules are still treated as ESM when Node omits module format metadata

Testing

  • Added focused unit coverage for the new OpenAI Agents instrumentation config and trace processor
  • Added an e2e test that exercises the real OpenAI Agents SDK with OPENAI_API_KEY and validates the emitted Braintrust spans
  • Verified formatting, TypeScript, and the targeted Vitest suites pass locally

Fixes #1888

Comment thread js/src/auto-instrumentations/configs/openai-agents.test.ts
Comment thread js/src/auto-instrumentations/configs/openai-agents.ts
@lforst
Copy link
Copy Markdown
Member

Also: Can we add the e2e sceanario to the ci summary?

@Qard
Copy link
Copy Markdown
Contributor Author

e2e failures look unrelated.

@Qard Stephen Belanger (Qard) force-pushed the openai-agents-auto-instrumentation branch from 550ffb0 to 5009964 Compare May 6, 2026 20:13
Stephen Belanger (Qard) added a commit that referenced this pull request May 6, 2026
## Summary

Adds an inbound provider HTTP capture/replay layer (cassettes) to the
e2e test suite so hermetic CI runs replay recorded traffic instead of
hitting live provider APIs. Built on **MSW** (already in the workspace
as a dev dep for `integrations/langchain-js` and `integrations/otel-js`)
plus ~600 LoC of cassette-format/matcher/recorder glue.

E2E scenarios previously hit live provider APIs on every CI run.
Flakiness sources: rate limits, transient 5xx, model-output drift
breaking exact-string snapshot fields. With this layer:
- Replay mode (CI default) hits no network. Subprocess MSW interceptor
returns canned responses from `__cassettes__/<variantKey>.json`.
- Recording is local-only via
`BRAINTRUST_E2E_CASSETTE_MODE=record-missing`. CI never records.
- The existing outbound `mock-braintrust-server.ts` and `__snapshots__/`
are untouched. Cassettes are the inbound mirror: provider→SDK, where
snapshots are SDK→Braintrust.

**Status:** ~506 tests passing in hermetic mode across 25 scenarios, 2
of 3 consecutive runs deterministic (the third had an unrelated
`turbopack-auto-instrumentation` Next.js compile timeout flake — not
introduced by this PR).

## Architecture

| Layer | Direction | Captured by | Stored in |
|---|---|---|---|
| Existing | SDK → Braintrust | `mock-braintrust-server.ts` (parent
process HTTP server) | `__snapshots__/*.json` |
| **New** | Provider → SDK | `cassette/preload.mjs` boots an MSW
`setupServer()` in subprocess | `__cassettes__/<variantKey>.json` |

- `e2e/helpers/cassette/preload.mjs` — loaded into each scenario
subprocess via `node --import=<preload>`. Boots MSW synchronously and
intercepts provider HTTP traffic.
- Per-key call counter for retries; reuses single entry on every match
if only one entry exists for a key.
- Recorder server runs in the parent vitest process; subprocess POSTs
captured wire to it; parent merges entries across variant runs and
writes once at scenario end.
- `cassetteTagsFor(import.meta.url, variantKey)` auto-tags scenarios
with `hermetic` based on cassette file presence — opt-in is by
committing the cassette.

## Cassette modes (`BRAINTRUST_E2E_CASSETTE_MODE`)

- `replay` (default in CI): match or throw `CassetteMissError`.
- `record`: overwrite cassette fresh.
- `record-missing`: match if possible, else live + record. Standard
re-record loop.
- `passthrough`: bypass cassettes entirely (local debugging).

## Recording safeguards

- **Skip-list filters transient/auth failures** so they don't poison the
cassette: 400/401/403/429/5xx are not persisted. Caught one real bug
here — google-genai cassettes had been recorded with a bad key,
capturing `400 API_KEY_INVALID` responses. The 400 case was added after.
- **Automatic retry-with-backoff for 429/5xx** during recording (2
attempts, 5s/10s, abort-aware to respect SDK timeouts).
- **Volatile-header scrub** strips `authorization`, `x-api-key`,
`api-key`, `x-goog-api-key`, `cohere-api-key`, cookies, request IDs,
rate-limit windows, `content-encoding`, etc. before persisting. (Caught
a near-miss on this PR — the initial commit leaked `x-api-key` for
Anthropic; volatile-header set has been broadened and a scrub run
removed leaked values.)
- **Headers forwarding fix** — `new Headers(request.headers)` silently
drops most headers when the source is an MSW-intercepted request
(Authorization included). The forwarder copies via `forEach` instead.
This one bug was responsible for the bulk of the recording failures
during initial migration (every Mistral 401, plenty of others).

## Scenarios with complete cassettes (hermetic green)

- `anthropic-instrumentation` (6 variants)
- `openai-instrumentation` (3 variants)
- `claude-agent-sdk-instrumentation`
- `openrouter-instrumentation` (2 variants)
- `ai-sdk-instrumentation` (4 variants)
- `ai-sdk-otel-export` (2 variants)
- `groq-instrumentation` (2 variants)
- `huggingface-instrumentation` (3 variants)
- `openrouter-agent-instrumentation`
- `wrap-langchain-js-traces`
- `cassette-replay` (meta-scenario validating record→replay loop
end-to-end)
- `cohere-instrumentation` v7-14-0 (1 of 5 variants — see below)

## Scenarios still missing cassettes (auto-skipped in hermetic mode)

These auto-skip cleanly because `cassetteTagsFor` only applies the
`hermetic` tag when the cassette file is present. CI does not fail on
them today; they need a follow-up record run with working credentials.

### `mistral-instrumentation` — needs re-record after rebase
- This branch was originally based on PR #1891 (OpenAI Agents
auto-instrumentation). After being rebased onto `main`, the existing
mistral cassettes no longer match: main extended the mistral scenario
with new thinking/reasoning model coverage (`NATIVE_REASONING_MODEL`,
`ADJUSTABLE_REASONING_MODEL`) that wasn't in the older scenario shape
the cassettes were recorded against.
- Re-recording also hit aggressive provider rate limits on the new
reasoning models — the existing cassette layer's retry-with-backoff
(5s/10s, 2 attempts) is not sufficient for that endpoint, so
re-recording stalled mid-suite. Solving this likely requires either a
longer per-call throttle (mirroring what `cohere-instrumentation`
already does — `COHERE_RECORD_THROTTLE_MS = 60_000`) or running mistral
variants serially with longer waits between calls.
- Cassettes for the older scenario shape were dropped to keep the suite
green; mistral now auto-skips in hermetic mode until the follow-up
re-record lands.
- **Action needed:** add a record-time throttle to mistral's
`scenario.impl.mjs` and run `BRAINTRUST_E2E_CASSETTE_MODE=record-missing
pnpm --filter=@braintrust/js-e2e-tests vitest run
scenarios/mistral-instrumentation`.

### `google-genai-instrumentation` — Gemini quota exhausted
- Quota on the recording key is fully consumed (`RESOURCE_EXHAUSTED
429`).
- Initial cassettes had been recorded back when the key was invalid and
captured `400 API_KEY_INVALID` responses; those were detected and
deleted in this PR. The skip-list now rejects 400 to prevent recurrence.
- **Action needed:** rotate to a working Gemini key (or wait for quota
reset) and run `BRAINTRUST_E2E_CASSETTE_MODE=record-missing pnpm
--filter=@braintrust/js-e2e-tests vitest run
scenarios/google-genai-instrumentation`.

### `cohere-instrumentation` — per-MONTH quota exhausted (4 of 5
variants)
- Cohere account is past the per-MONTH request limit on the chat models.
Quote from API: `"You are past the per-month request limit for this
model, please wait and try again later."` This is monthly, not daily —
recovers on the next billing cycle.
- The `v7-14-0` variant is fully recorded (chat + chat-stream + embed +
rerank) and replays green. The 4 remaining variants (`v7-20-0`,
`v7-21-0`, `v7` default, `v8`) auto-skip until re-recorded.
- The scenario impl now includes a 60s throttle between calls during
recording (`COHERE_RECORD_THROTTLE_MS`) to land each call in a fresh
budget window once quota is restored — but the throttle can't help with
monthly exhaustion.
- **Action needed:** wait for the next billing cycle (or upgrade Cohere
plan), then run the same record command for the cohere scenarios.

### `google-adk-instrumentation` — model-behavior drift, unrelated to
cassette layer
- ADK does not have any `__cassettes__/` files in this PR, so it
auto-skips in hermetic mode. There is pre-existing snapshot drift
unrelated to this PR which should be triaged independently.

## Risks / things to watch

- **Cassette files are large and committed.**
`claude-agent-sdk-instrumentation` and `ai-sdk-instrumentation` have the
largest cassettes (long transcripts). This is intentional — diff-ability
matters for review, and we want byte-identical replay.
- **Re-record workflow is documented** in `e2e/README.md` ("Cassettes"
section) and `.agents/skills/e2e-tests/SKILL.md`.
- **Day-one scope excluded** Deno scenarios, `nextjs-instrumentation`,
`turbopack-auto-instrumentation`, and OTEL-only scenarios. Those need
separate preload mechanisms; they continue running as before (or are
already hermetic-ish via different machinery).

## Test plan

- [x] `pnpm --filter=@braintrust/js-e2e-tests exec vitest run
--tags-filter=hermetic` — green (506 passed, 396 skipped, 0 failed)
- [x] Same command run consecutively for determinism — green twice in a
row (one intermediate run hit an unrelated turbopack flake)
- [x] `pnpm run formatting` clean
- [x] `pnpm run lint` — 0 errors
- [x] No leaked credentials in committed cassettes (`x-api-key`,
`api-key`, `x-goog-api-key`, `authorization`, etc. all stripped)
- [ ] CI hermetic suite green
- [ ] Re-record blocked scenarios (mistral, google-genai, remaining
cohere variants) once credentials/quota are available — follow-up PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Stephen Belanger <stephenbelanger@s-belanger.localdomain>
@Qard Stephen Belanger (Qard) force-pushed the openai-agents-auto-instrumentation branch from a9e89d1 to aa5533f Compare May 7, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add automatic instrumentation for the openai agents sdk

2 participants