Feat/braintrust tracing#1333
Conversation
| ) | ||
| return result | ||
|
|
||
| harness.run = traced |
There was a problem hiding this comment.
Context leak when harness run raises unexpectedly
Low Severity
If the original run() raises after setup_state completes (e.g., asyncio.CancelledError during concurrent rollouts via asyncio.gather), _ctx.pop(id(result)) is never reached because result is never assigned. The _ctx entry added by _patch_setup leaks indefinitely. In long-running processes with occasional cancellations, stale entries accumulate in the module-level _ctx dict, and if Python reuses an id() value, a new rollout could briefly inherit a stale _Turns object before _patch_setup overwrites it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 312af93. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.
| _patch_run(env.harness, bt) | ||
| _patch_setup(env.harness, bt) | ||
| _patch_submit(env.harness.runtime, bt) | ||
| _patch_tool(env.harness.runtime, bt) |
There was a problem hiding this comment.
Runtime recreation silently loses tracing patches
Medium Severity
instrument() patches submit_model_request and call_tool on a specific Runtime instance via env.harness.runtime. However, Harness replaces self.runtime with a brand-new Runtime object whenever add_metric, add_reward, add_toolset, add_stop, add_setup, add_update, or add_cleanup is called (each triggers self.runtime = self.resolve_runtime()). If any of these are called after instrumentation, the runtime patches are silently lost and model request / tool call tracing stops working.
Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.


Description
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Patches core
Harness/Runtimemethods and auto-enables via env var, which could subtly affect rollout execution order, error propagation, or span lifecycle if assumptions about state/trajectory differ across environments.Overview
Adds a new Braintrust tracing integration (
verifiers.integrations.braintrust) that instruments v1 rollouts by patchingharness.run,harness.setup_state,runtime.submit_model_request, andruntime.call_toolto emit arolloutspan with per-turnmodel_request/tool_call:<name>child spans, token metrics, timings, and rollout scores/metadata.Enables automatic tracing in
verifiers/v1/env.pywhenVF_BRAINTRUST_PROJECTis set, and documents setup/manual usage inintegrations/README.md.Introduces comprehensive tests with a mocked
braintrustmodule to validate span hierarchy, logged scores/metadata, timing/token metrics, grouping viatraced_group, and error/edge-case span cleanup behavior.Reviewed by Cursor Bugbot for commit a3ff5c0. Bugbot is set up for automated code reviews on this repo. Configure here.