Skip to content

Feat/braintrust tracing#1333

Open
jessicaifeiyali wants to merge 4 commits into
mainfrom
feat/braintrust-tracing
Open

Feat/braintrust tracing#1333
jessicaifeiyali wants to merge 4 commits into
mainfrom
feat/braintrust-tracing

Conversation

@jessicaifeiyali
Copy link
Copy Markdown

@jessicaifeiyali jessicaifeiyali commented May 11, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Patches core Harness/Runtime methods and auto-enables via env var, which could subtly affect rollout execution order, error propagation, or span lifecycle if assumptions about state/trajectory differ across environments.

Overview
Adds a new Braintrust tracing integration (verifiers.integrations.braintrust) that instruments v1 rollouts by patching harness.run, harness.setup_state, runtime.submit_model_request, and runtime.call_tool to emit a rollout span with per-turn model_request/tool_call:<name> child spans, token metrics, timings, and rollout scores/metadata.

Enables automatic tracing in verifiers/v1/env.py when VF_BRAINTRUST_PROJECT is set, and documents setup/manual usage in integrations/README.md.

Introduces comprehensive tests with a mocked braintrust module to validate span hierarchy, logged scores/metadata, timing/token metrics, grouping via traced_group, and error/edge-case span cleanup behavior.

Reviewed by Cursor Bugbot for commit a3ff5c0. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread verifiers/v1/env.py
)
return result

harness.run = traced
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context leak when harness run raises unexpectedly

Low Severity

If the original run() raises after setup_state completes (e.g., asyncio.CancelledError during concurrent rollouts via asyncio.gather), _ctx.pop(id(result)) is never reached because result is never assigned. The _ctx entry added by _patch_setup leaks indefinitely. In long-running processes with occasional cancellations, stale entries accumulate in the module-level _ctx dict, and if Python reuses an id() value, a new rollout could briefly inherit a stale _Turns object before _patch_setup overwrites it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 312af93. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.

_patch_run(env.harness, bt)
_patch_setup(env.harness, bt)
_patch_submit(env.harness.runtime, bt)
_patch_tool(env.harness.runtime, bt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime recreation silently loses tracing patches

Medium Severity

instrument() patches submit_model_request and call_tool on a specific Runtime instance via env.harness.runtime. However, Harness replaces self.runtime with a brand-new Runtime object whenever add_metric, add_reward, add_toolset, add_stop, add_setup, add_update, or add_cleanup is called (each triggers self.runtime = self.resolve_runtime()). If any of these are called after instrumentation, the runtime patches are silently lost and model request / tool call tracing stops working.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant