Add Braintrust tracing as experimental drop-in environment variants by cdreetz · Pull Request #1325 · PrimeIntellect-ai/verifiers

cdreetz · 2026-05-09T23:13:13Z

Summary

Adds opt-in Braintrust tracing to verifiers as experimental drop-in replacements for the core environment classes, living entirely under verifiers/envs/experimental/braintrust_tracing/. The original core files (environment.py, multiturn_env.py, tool_env.py, stateful_tool_env.py) are unchanged from main — no tracing code touches production paths.

When BRAINTRUST_API_KEY is set, every rollout emits a trace with child spans for setup, turns, model requests, tool calls, and scoring. When unset, all tracing is a no-op with near-zero overhead.

Usage

# Instead of:
from verifiers.envs.stateful_tool_env import StatefulToolEnv

# Use the tracing variant:
from verifiers.envs.experimental.braintrust_tracing.stateful_tool_env import StatefulToolEnv

pip install 'verifiers[braintrust]'
export BRAINTRUST_API_KEY="sk-..."
export VF_BRAINTRUST_PROJECT="my-project"  # optional, defaults to "verifiers"

Structure

verifiers/envs/experimental/braintrust_tracing/
├── __init__.py                # re-exports all four classes
├── braintrust_tracing.py      # self-contained tracing layer
├── environment.py             # Environment with tracing
├── multiturn_env.py           # MultiTurnEnv with tracing
├── tool_env.py                # ToolEnv with tracing
└── stateful_tool_env.py       # StatefulToolEnv with tracing

Inheritance chain within the experimental folder: Environment → MultiTurnEnv → ToolEnv → StatefulToolEnv. Each class inherits from its experimental sibling (not from the original vf.* classes), so the full tracing instrumentation propagates through the hierarchy.

`braintrust_tracing.py` module

A self-contained tracing layer with:

Lazy singleton logger initialization (thread-safe, import-safe)
All errors swallowed so tracing never kills an eval
_safe() / _safe_response() serialization helpers for Pydantic models and arbitrary objects
contextvars.ContextVar for passing the rollout span from _run_rollout_state → rollout() across the await boundary (concurrent-rollout safe)
Run-level tags: each generate() call auto-generates a unique tag (run-<epoch>-<short_uuid>) attached to every root span (rollout, group, generate), enabling filter/group by eval run in Braintrust. Tags stored in a ContextVar for concurrent safety. Can also be set manually via set_run_tags(['my-tag']).

Instrumented spans

environment.py — get_model_response (LLM spans with token counts), _run_rollout_state (rollout + scoring spans), _run_group_states (group spans with per-rollout child spans), generate (run tag lifecycle + flush on exit)
multiturn_env.py — rollout() setup spans, per-turn spans (with model/env duration breakdown), timeout logging
tool_env.py — call_tool() tool spans with args/result/duration; env_response() passes state= to call_tool for span parenting
stateful_tool_env.py — env_response() passes state=state to call_tool, enabling tool tracing for all StatefulToolEnv subclasses

`pyproject.toml` changes

braintrust added as optional dep group [braintrust]
[[tool.ty.overrides]] added for verifiers/envs/experimental/braintrust_tracing/** to suppress unresolved-import (braintrust is optional, not installed in CI)

Review & Testing Checklist

Experimental files will drift from main: The five files under experimental/braintrust_tracing/ are full copies of their core counterparts with tracing added. Any future changes to the core environment.py, multiturn_env.py, tool_env.py, or stateful_tool_env.py must be manually mirrored to the experimental copies. Verify this maintenance burden is acceptable vs. a mixin/decorator approach.
Import chain correctness: Each experimental class inherits from its experimental sibling (e.g., ToolEnv(MultiTurnEnv) where MultiTurnEnv is the local tracing variant). But decorators like @vf.stop, @vf.cleanup and types like vf.State, vf.Error still come from the original verifiers package. Verify these cross-references work correctly at runtime.
call_tool public API change: call_tool(**kwargs) reads kwargs.get("state") for span parenting, and env_response() passes state=state. Any third-party subclass that overrides call_tool and calls super().call_tool(...) without forwarding state will lose tool span nesting. Verify for downstream environments (e.g., MiniBrowseEnv).
End-to-end verification: Install with pip install -e '.[braintrust]', set BRAINTRUST_API_KEY, import StatefulToolEnv from the experimental path, and run an eval. Confirm nested spans (rollout → setup → turns → model/tool calls) appear in Braintrust with correct run-level tags.

Notes

All pre-commit checks (ruff, ruff-format, ty) pass locally.
This is a parallel/alternative to the Datadog telemetry in How hard is it to integrate non-LLM environment feedbacks into this framework? #9, targeting a different observability backend.
Traces previously confirmed correct in Braintrust for grouped 5×3 evals.

Note

Medium Risk
Adds a large new experimental environment stack plus an optional braintrust dependency; while isolated from existing paths, it introduces new concurrency/contextvar-based tracing logic and substantial duplicated code that could drift from core behavior.

Overview
Adds an opt-in experimental Braintrust tracing implementation under verifiers/envs/experimental/braintrust_tracing/, providing drop-in Environment/MultiTurnEnv/ToolEnv/StatefulToolEnv variants that emit nested spans for rollout setup, per-turn execution, model requests (with token/duration metrics), tool calls, and scoring.

Introduces a self-contained braintrust_tracing.py layer with lazy logger init gated by BRAINTRUST_API_KEY, run-level tagging via ContextVar, best-effort serialization, and defensive no-op/error-swallowing behavior; the experimental environment.py wires these hooks into generate(), per-rollout/group execution, and request/tool paths.

Updates packaging to add a braintrust optional extra (verifiers[braintrust]), adds ty overrides to ignore optional imports in the experimental folder, and refreshes uv.lock to include braintrust and its transitive dependencies.

^{Reviewed by Cursor Bugbot for commit 5558e02. Bugbot is set up for automated code reviews on this repo. Configure here.}

This commit enhances the evaluation logging and TUI to include: 1. Tool definitions: Store available tools in rollout state and save to results - ToolEnv now adds tool definitions (OAI schema) to state during init_state - Results.jsonl includes tools field when tools are present 2. Judge rubric data: Store structured judge information for TUI display - JudgeRubric now stores judge_data with prompt, response, inputs, and model - Results.jsonl includes judge_data field when judge rubric is used 3. TUI enhancements: New keybinds to view tools and judge data - Press 't' to view tools available to agent during a rollout - Press 'j' to view judge rubric results and inputs - Both open modal windows with formatted, scrollable content These additions make debugging and analysis much easier by providing full visibility into the agent's tool environment and LLM judge evaluations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…QkhMCLqyri6H5Nh8XJn Add tools and judge rubric data to saved runs and TUI

…ort-011CUQkhMCLqyri6H5Nh8XJn Revert "Add tools and judge rubric data to saved runs and TUI"

- New verifiers/braintrust_tracing.py module with nested span support - Instrument environment.py: rollout lifecycle, model requests, scoring, groups - Instrument multiturn_env.py: setup_state, turn loop, timeouts - Instrument tool_env.py: tool call tracking with duration and errors - Add braintrust as optional dependency group [braintrust] Span hierarchy per rollout: rollout (task) -> setup_state (task) -> turn_N (task) -> model_request (llm) + tool_call (tool) -> scoring (score) Activation: set BRAINTRUST_API_KEY env var. No-op when unset. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

- Fix critical span nesting: rollout span is now stashed on env instance via _bt_pending_rollout_span so multiturn_env.rollout() can attach it to state immediately after init_state, before any child spans are created - Add ty override for braintrust_tracing.py to suppress unresolved-import (braintrust is an optional dependency, not installed in CI) - Remove stale type: ignore comments that ty doesn't recognize Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

… passing Replace self._bt_pending_rollout_span with a contextvars.ContextVar so concurrent rollouts don't overwrite each other's spans. Each coroutine gets its own copy of the context variable, making it safe for the concurrent rollout pattern in generate(). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

…racing Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

_run_group_states calls self.rollout() directly (bypassing _run_rollout_state), so no rollout spans were created in grouped mode. Wrap each rollout in a _traced_rollout helper that creates a child span under the group root, sets the ContextVar, and finalizes the span with completion data. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

…l after scoring 1. get_model_response: catch BaseException (incl. CancelledError) so response is never referenced unbound in the finally block. 2. _run_group_states: defer rollout_completed calls until after score_group so reward values are available in the spans. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Each generate() call auto-generates a unique tag (run-<epoch>-<short_uuid>) that is attached to every root span (rollout, group, generate). This lets users filter and group traces by eval run in the Braintrust UI. Tags can also be set manually via _bt.set_run_tags(['my-tag']). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

1. _run_tags is now a ContextVar so concurrent generate() calls each get their own isolated tag set instead of overwriting a shared global. 2. set_run_tags() is only called when _bt.enabled() is True, avoiding unnecessary UUID generation and log noise when Braintrust is off. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

If client.get_response() raises and repr(exc) also raises, response would be unbound. Adding response=None and an 'is not None' guard prevents a NameError from masking the original exception. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Only one except clause fires per try block, so error_msg is always empty when BaseException (non-Exception) handler runs. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

…cing/ Revert original environment.py, multiturn_env.py, tool_env.py, and stateful_tool_env.py back to main. All braintrust-instrumented variants now live under verifiers/envs/experimental/braintrust_tracing/ as drop-in replacements. Usage: from verifiers.envs.experimental.braintrust_tracing.stateful_tool_env import StatefulToolEnv Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Pre-allocate lists and assign by index instead of appending, so span attribution is correct regardless of task scheduling order. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Use list[object | None] so ty accepts [None] * len(group_inputs). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

cursor · 2026-05-09T23:20:33Z

+# each get their own tag set without overwriting each other.
+_run_tags: contextvars.ContextVar[list[str]] = contextvars.ContextVar(
+    "_run_tags", default=[]
+)


Mutable default in ContextVar creates shared state risk

Low Severity

The _run_tags ContextVar uses default=[], which is a single mutable list object shared across all contexts that haven't explicitly called .set(). While current callers always copy the returned list before use, any future code that does _run_tags.get().append(...) would silently corrupt the shared default, affecting all contexts. The safe pattern is to use a sentinel or immutable default (e.g., default=()) and convert to a list where needed.

^{Reviewed by Cursor Bugbot for commit b0460f6. Configure here.}

cursor · 2026-05-09T23:20:33Z

+            error=repr(state["error"])[:500] if state.get("error") else "",
+            input_tokens=float(usage.get("input_tokens", 0)),
+            output_tokens=float(usage.get("output_tokens", 0)),
+        )


Rollout span leaks when rollout or scoring raises

Medium Severity

In _run_rollout_state, the calls to self.rollout(), self.rubric.score_rollout(), and self.rubric.cleanup() are not wrapped in a try/finally that ensures _bt.rollout_completed() is called. If any of these raise an exception, the Braintrust rollout span is never closed, leading to leaked open spans. The same pattern applies in _run_group_states where _traced_rollout doesn't protect against exceptions from self.rollout().

^{Reviewed by Cursor Bugbot for commit b0460f6. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5558e02. Configure here.}

cursor · 2026-05-09T23:27:54Z

+            await self.rubric.dummy_score_group(group_states)
+        end_scoring = time.time()
+        for state in group_states:
+            state["timing"].scoring.end = end_scoring


Group scoring path missing Braintrust scoring child spans

Low Severity

In _run_rollout_state, scoring is wrapped with scoring_started/scoring_completed spans, but _run_group_states performs group scoring without any corresponding Braintrust scoring spans. Since group scoring is the default path (used when independent_scoring=False), most evaluations will be missing the "scoring" child span in their traces, making the tracing hierarchy inconsistent.

Additional Locations (1)

verifiers/envs/experimental/braintrust_tracing/environment.py#L739-L752

^{Reviewed by Cursor Bugbot for commit 5558e02. Configure here.}

claude and others added 24 commits October 23, 2025 20:29

Merge pull request #1 from cdreetz/claude/add-save-flag-support-011CU…

0cbd7ff

…QkhMCLqyri6H5Nh8XJn Add tools and judge rubric data to saved runs and TUI

Revert "Add tools and judge rubric data to saved runs and TUI"

1f0e9fc

Merge pull request #2 from cdreetz/revert-1-claude/add-save-flag-supp…

77640ad

…ort-011CUQkhMCLqyri6H5Nh8XJn Revert "Add tools and judge rubric data to saved runs and TUI"

Merge branch 'PrimeIntellect-ai:main' into main

8967698

Merge branch 'PrimeIntellect-ai:main' into main

ff4d830

Merge branch 'PrimeIntellect-ai:main' into main

edf76b2

Merge branch 'PrimeIntellect-ai:main' into main

1837bf7

Merge branch 'PrimeIntellect-ai:main' into main

32c9991

Update uv.lock for braintrust dependency

16648a1

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Pass state=state in StatefulToolEnv.env_response call_tool for tool t…

f45c2de

…racing Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Remove dead if-guard in except BaseException handler

b0460f6

Only one except clause fires per try block, so error_msg is always empty when BaseException (non-Exception) handler runs. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Use index-based span tracking in _run_group_states

c2abc54

Pre-allocate lists and assign by index instead of appending, so span attribution is correct regardless of task scheduling order. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Fix type annotation for bt_rollout_spans list

472ea2a

Use list[object | None] so ty accepts [None] * len(group_inputs). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

Fix ty: use list comprehension for bt_rollout_spans init

5558e02

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>

cursor Bot reviewed May 9, 2026

View reviewed changes

cdreetz mentioned this pull request May 9, 2026

Add monkey-patching integration for Braintrust tracing #1326

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Braintrust tracing as experimental drop-in environment variants#1325

Add Braintrust tracing as experimental drop-in environment variants#1325
cdreetz wants to merge 24 commits intomainfrom
cdreetz/braintrust-tracing

cdreetz commented May 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot May 9, 2026

Uh oh!

cursor Bot May 9, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cdreetz commented May 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Structure

braintrust_tracing.py module

Instrumented spans

pyproject.toml changes

Review & Testing Checklist

Notes

Uh oh!

cursor Bot May 9, 2026

Choose a reason for hiding this comment

Mutable default in ContextVar creates shared state risk

Uh oh!

cursor Bot May 9, 2026

Choose a reason for hiding this comment

Rollout span leaks when rollout or scoring raises

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 9, 2026

Choose a reason for hiding this comment

Group scoring path missing Braintrust scoring child spans

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cdreetz commented May 9, 2026 •

edited by cursor Bot

Loading

`braintrust_tracing.py` module

`pyproject.toml` changes