Skip to content

Add Braintrust tracing as experimental drop-in environment variants#1325

Open
cdreetz wants to merge 24 commits intomainfrom
cdreetz/braintrust-tracing
Open

Add Braintrust tracing as experimental drop-in environment variants#1325
cdreetz wants to merge 24 commits intomainfrom
cdreetz/braintrust-tracing

Conversation

@cdreetz
Copy link
Copy Markdown
Collaborator

@cdreetz cdreetz commented May 9, 2026

Summary

Adds opt-in Braintrust tracing to verifiers as experimental drop-in replacements for the core environment classes, living entirely under verifiers/envs/experimental/braintrust_tracing/. The original core files (environment.py, multiturn_env.py, tool_env.py, stateful_tool_env.py) are unchanged from main — no tracing code touches production paths.

When BRAINTRUST_API_KEY is set, every rollout emits a trace with child spans for setup, turns, model requests, tool calls, and scoring. When unset, all tracing is a no-op with near-zero overhead.

Usage

# Instead of:
from verifiers.envs.stateful_tool_env import StatefulToolEnv

# Use the tracing variant:
from verifiers.envs.experimental.braintrust_tracing.stateful_tool_env import StatefulToolEnv
pip install 'verifiers[braintrust]'
export BRAINTRUST_API_KEY="sk-..."
export VF_BRAINTRUST_PROJECT="my-project"  # optional, defaults to "verifiers"

Structure

verifiers/envs/experimental/braintrust_tracing/
├── __init__.py                # re-exports all four classes
├── braintrust_tracing.py      # self-contained tracing layer
├── environment.py             # Environment with tracing
├── multiturn_env.py           # MultiTurnEnv with tracing
├── tool_env.py                # ToolEnv with tracing
└── stateful_tool_env.py       # StatefulToolEnv with tracing

Inheritance chain within the experimental folder: Environment → MultiTurnEnv → ToolEnv → StatefulToolEnv. Each class inherits from its experimental sibling (not from the original vf.* classes), so the full tracing instrumentation propagates through the hierarchy.

braintrust_tracing.py module

A self-contained tracing layer with:

  • Lazy singleton logger initialization (thread-safe, import-safe)
  • All errors swallowed so tracing never kills an eval
  • _safe() / _safe_response() serialization helpers for Pydantic models and arbitrary objects
  • contextvars.ContextVar for passing the rollout span from _run_rollout_staterollout() across the await boundary (concurrent-rollout safe)
  • Run-level tags: each generate() call auto-generates a unique tag (run-<epoch>-<short_uuid>) attached to every root span (rollout, group, generate), enabling filter/group by eval run in Braintrust. Tags stored in a ContextVar for concurrent safety. Can also be set manually via set_run_tags(['my-tag']).

Instrumented spans

  • environment.pyget_model_response (LLM spans with token counts), _run_rollout_state (rollout + scoring spans), _run_group_states (group spans with per-rollout child spans), generate (run tag lifecycle + flush on exit)
  • multiturn_env.pyrollout() setup spans, per-turn spans (with model/env duration breakdown), timeout logging
  • tool_env.pycall_tool() tool spans with args/result/duration; env_response() passes state= to call_tool for span parenting
  • stateful_tool_env.pyenv_response() passes state=state to call_tool, enabling tool tracing for all StatefulToolEnv subclasses

pyproject.toml changes

  • braintrust added as optional dep group [braintrust]
  • [[tool.ty.overrides]] added for verifiers/envs/experimental/braintrust_tracing/** to suppress unresolved-import (braintrust is optional, not installed in CI)

Review & Testing Checklist

  • Experimental files will drift from main: The five files under experimental/braintrust_tracing/ are full copies of their core counterparts with tracing added. Any future changes to the core environment.py, multiturn_env.py, tool_env.py, or stateful_tool_env.py must be manually mirrored to the experimental copies. Verify this maintenance burden is acceptable vs. a mixin/decorator approach.
  • Import chain correctness: Each experimental class inherits from its experimental sibling (e.g., ToolEnv(MultiTurnEnv) where MultiTurnEnv is the local tracing variant). But decorators like @vf.stop, @vf.cleanup and types like vf.State, vf.Error still come from the original verifiers package. Verify these cross-references work correctly at runtime.
  • call_tool public API change: call_tool(**kwargs) reads kwargs.get("state") for span parenting, and env_response() passes state=state. Any third-party subclass that overrides call_tool and calls super().call_tool(...) without forwarding state will lose tool span nesting. Verify for downstream environments (e.g., MiniBrowseEnv).
  • End-to-end verification: Install with pip install -e '.[braintrust]', set BRAINTRUST_API_KEY, import StatefulToolEnv from the experimental path, and run an eval. Confirm nested spans (rollout → setup → turns → model/tool calls) appear in Braintrust with correct run-level tags.

Notes


Note

Medium Risk
Adds a large new experimental environment stack plus an optional braintrust dependency; while isolated from existing paths, it introduces new concurrency/contextvar-based tracing logic and substantial duplicated code that could drift from core behavior.

Overview
Adds an opt-in experimental Braintrust tracing implementation under verifiers/envs/experimental/braintrust_tracing/, providing drop-in Environment/MultiTurnEnv/ToolEnv/StatefulToolEnv variants that emit nested spans for rollout setup, per-turn execution, model requests (with token/duration metrics), tool calls, and scoring.

Introduces a self-contained braintrust_tracing.py layer with lazy logger init gated by BRAINTRUST_API_KEY, run-level tagging via ContextVar, best-effort serialization, and defensive no-op/error-swallowing behavior; the experimental environment.py wires these hooks into generate(), per-rollout/group execution, and request/tool paths.

Updates packaging to add a braintrust optional extra (verifiers[braintrust]), adds ty overrides to ignore optional imports in the experimental folder, and refreshes uv.lock to include braintrust and its transitive dependencies.

Reviewed by Cursor Bugbot for commit 5558e02. Bugbot is set up for automated code reviews on this repo. Configure here.

claude and others added 24 commits October 23, 2025 20:29
This commit enhances the evaluation logging and TUI to include:

1. Tool definitions: Store available tools in rollout state and save to results
   - ToolEnv now adds tool definitions (OAI schema) to state during init_state
   - Results.jsonl includes tools field when tools are present

2. Judge rubric data: Store structured judge information for TUI display
   - JudgeRubric now stores judge_data with prompt, response, inputs, and model
   - Results.jsonl includes judge_data field when judge rubric is used

3. TUI enhancements: New keybinds to view tools and judge data
   - Press 't' to view tools available to agent during a rollout
   - Press 'j' to view judge rubric results and inputs
   - Both open modal windows with formatted, scrollable content

These additions make debugging and analysis much easier by providing
full visibility into the agent's tool environment and LLM judge evaluations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…QkhMCLqyri6H5Nh8XJn

Add tools and judge rubric data to saved runs and TUI
…ort-011CUQkhMCLqyri6H5Nh8XJn

Revert "Add tools and judge rubric data to saved runs and TUI"
- New verifiers/braintrust_tracing.py module with nested span support
- Instrument environment.py: rollout lifecycle, model requests, scoring, groups
- Instrument multiturn_env.py: setup_state, turn loop, timeouts
- Instrument tool_env.py: tool call tracking with duration and errors
- Add braintrust as optional dependency group [braintrust]

Span hierarchy per rollout:
  rollout (task) -> setup_state (task) -> turn_N (task) ->
    model_request (llm) + tool_call (tool) -> scoring (score)

Activation: set BRAINTRUST_API_KEY env var. No-op when unset.
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
- Fix critical span nesting: rollout span is now stashed on env instance
  via _bt_pending_rollout_span so multiturn_env.rollout() can attach it
  to state immediately after init_state, before any child spans are created
- Add ty override for braintrust_tracing.py to suppress unresolved-import
  (braintrust is an optional dependency, not installed in CI)
- Remove stale type: ignore comments that ty doesn't recognize

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
… passing

Replace self._bt_pending_rollout_span with a contextvars.ContextVar so
concurrent rollouts don't overwrite each other's spans.  Each coroutine
gets its own copy of the context variable, making it safe for the
concurrent rollout pattern in generate().

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…racing

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
_run_group_states calls self.rollout() directly (bypassing _run_rollout_state),
so no rollout spans were created in grouped mode.  Wrap each rollout in a
_traced_rollout helper that creates a child span under the group root,
sets the ContextVar, and finalizes the span with completion data.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…l after scoring

1. get_model_response: catch BaseException (incl. CancelledError) so
   response is never referenced unbound in the finally block.
2. _run_group_states: defer rollout_completed calls until after
   score_group so reward values are available in the spans.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Each generate() call auto-generates a unique tag (run-<epoch>-<short_uuid>)
that is attached to every root span (rollout, group, generate).  This lets
users filter and group traces by eval run in the Braintrust UI.

Tags can also be set manually via _bt.set_run_tags(['my-tag']).

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
1. _run_tags is now a ContextVar so concurrent generate() calls each
   get their own isolated tag set instead of overwriting a shared global.
2. set_run_tags() is only called when _bt.enabled() is True, avoiding
   unnecessary UUID generation and log noise when Braintrust is off.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
If client.get_response() raises and repr(exc) also raises, response
would be unbound. Adding response=None and an 'is not None' guard
prevents a NameError from masking the original exception.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Only one except clause fires per try block, so error_msg is always
empty when BaseException (non-Exception) handler runs.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…cing/

Revert original environment.py, multiturn_env.py, tool_env.py, and
stateful_tool_env.py back to main. All braintrust-instrumented variants
now live under verifiers/envs/experimental/braintrust_tracing/ as
drop-in replacements.

Usage: from verifiers.envs.experimental.braintrust_tracing.stateful_tool_env import StatefulToolEnv
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Pre-allocate lists and assign by index instead of appending, so span
attribution is correct regardless of task scheduling order.

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Use list[object | None] so ty accepts [None] * len(group_inputs).

Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
# each get their own tag set without overwriting each other.
_run_tags: contextvars.ContextVar[list[str]] = contextvars.ContextVar(
"_run_tags", default=[]
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable default in ContextVar creates shared state risk

Low Severity

The _run_tags ContextVar uses default=[], which is a single mutable list object shared across all contexts that haven't explicitly called .set(). While current callers always copy the returned list before use, any future code that does _run_tags.get().append(...) would silently corrupt the shared default, affecting all contexts. The safe pattern is to use a sentinel or immutable default (e.g., default=()) and convert to a list where needed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b0460f6. Configure here.

Comment thread verifiers/envs/environment.py Outdated
error=repr(state["error"])[:500] if state.get("error") else "",
input_tokens=float(usage.get("input_tokens", 0)),
output_tokens=float(usage.get("output_tokens", 0)),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollout span leaks when rollout or scoring raises

Medium Severity

In _run_rollout_state, the calls to self.rollout(), self.rubric.score_rollout(), and self.rubric.cleanup() are not wrapped in a try/finally that ensures _bt.rollout_completed() is called. If any of these raise an exception, the Braintrust rollout span is never closed, leading to leaked open spans. The same pattern applies in _run_group_states where _traced_rollout doesn't protect against exceptions from self.rollout().

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b0460f6. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5558e02. Configure here.

await self.rubric.dummy_score_group(group_states)
end_scoring = time.time()
for state in group_states:
state["timing"].scoring.end = end_scoring
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group scoring path missing Braintrust scoring child spans

Low Severity

In _run_rollout_state, scoring is wrapped with scoring_started/scoring_completed spans, but _run_group_states performs group scoring without any corresponding Braintrust scoring spans. Since group scoring is the default path (used when independent_scoring=False), most evaluations will be missing the "scoring" child span in their traces, making the tracing hierarchy inconsistent.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5558e02. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants