Feat/braintrust tracing by jessicaifeiyali · Pull Request #1333 · PrimeIntellect-ai/verifiers

jessicaifeiyali · 2026-05-11T03:34:30Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Patches core Harness/Runtime methods and auto-enables via env var, which could subtly affect rollout execution order, error propagation, or span lifecycle if assumptions about state/trajectory differ across environments.

Overview
Adds a new Braintrust tracing integration (verifiers.integrations.braintrust) that instruments v1 rollouts by patching harness.run, harness.setup_state, runtime.submit_model_request, and runtime.call_tool to emit a rollout span with per-turn model_request/tool_call:<name> child spans, token metrics, timings, and rollout scores/metadata.

Enables automatic tracing in verifiers/v1/env.py when VF_BRAINTRUST_PROJECT is set, and documents setup/manual usage in integrations/README.md.

Introduces comprehensive tests with a mocked braintrust module to validate span hierarchy, logged scores/metadata, timing/token metrics, grouping via traced_group, and error/edge-case span cleanup behavior.

^{Reviewed by Cursor Bugbot for commit a3ff5c0. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-05-11T03:43:19Z

+            )
+        return result
+
+    harness.run = traced


Context leak when harness run raises unexpectedly

Low Severity

If the original run() raises after setup_state completes (e.g., asyncio.CancelledError during concurrent rollouts via asyncio.gather), _ctx.pop(id(result)) is never reached because result is never assigned. The _ctx entry added by _patch_setup leaks indefinitely. In long-running processes with occasional cancellations, stale entries accumulate in the module-level _ctx dict, and if Python reuses an id() value, a new rollout could briefly inherit a stale _Turns object before _patch_setup overwrites it.

Additional Locations (1)

verifiers/integrations/braintrust.py#L69-L70

^{Reviewed by Cursor Bugbot for commit 312af93. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.}

cursor · 2026-05-11T04:12:06Z

+    _patch_run(env.harness, bt)
+    _patch_setup(env.harness, bt)
+    _patch_submit(env.harness.runtime, bt)
+    _patch_tool(env.harness.runtime, bt)


Runtime recreation silently loses tracing patches

Medium Severity

instrument() patches submit_model_request and call_tool on a specific Runtime instance via env.harness.runtime. However, Harness replaces self.runtime with a brand-new Runtime object whenever add_metric, add_reward, add_toolset, add_stop, add_setup, add_update, or add_cleanup is called (each triggers self.runtime = self.resolve_runtime()). If any of these are called after instrumentation, the runtime patches are silently lost and model request / tool call tracing stops working.

^{Reviewed by Cursor Bugbot for commit a3ff5c0. Configure here.}

jessicaifeiyali added 3 commits May 10, 2026 02:32

feat: add Braintrust tracing integration for v1 environments

b13cf9e

log prompt and completion on model_request spans

c6c7c74

log prompt and completion on model_request spans (prompt > get prompt)

312af93

cursor Bot reviewed May 11, 2026

View reviewed changes

add README for Braintrust integration

a3ff5c0

cursor Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/braintrust tracing#1333

Feat/braintrust tracing#1333
jessicaifeiyali wants to merge 4 commits into
mainfrom
feat/braintrust-tracing

jessicaifeiyali commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot May 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jessicaifeiyali commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Context leak when harness run raises unexpectedly

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Runtime recreation silently loses tracing patches

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jessicaifeiyali commented May 11, 2026 •

edited by cursor Bot

Loading