Add Braintrust tracing as experimental drop-in environment variants#1325
Add Braintrust tracing as experimental drop-in environment variants#1325
Conversation
This commit enhances the evaluation logging and TUI to include: 1. Tool definitions: Store available tools in rollout state and save to results - ToolEnv now adds tool definitions (OAI schema) to state during init_state - Results.jsonl includes tools field when tools are present 2. Judge rubric data: Store structured judge information for TUI display - JudgeRubric now stores judge_data with prompt, response, inputs, and model - Results.jsonl includes judge_data field when judge rubric is used 3. TUI enhancements: New keybinds to view tools and judge data - Press 't' to view tools available to agent during a rollout - Press 'j' to view judge rubric results and inputs - Both open modal windows with formatted, scrollable content These additions make debugging and analysis much easier by providing full visibility into the agent's tool environment and LLM judge evaluations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…QkhMCLqyri6H5Nh8XJn Add tools and judge rubric data to saved runs and TUI
…ort-011CUQkhMCLqyri6H5Nh8XJn Revert "Add tools and judge rubric data to saved runs and TUI"
- New verifiers/braintrust_tracing.py module with nested span support
- Instrument environment.py: rollout lifecycle, model requests, scoring, groups
- Instrument multiturn_env.py: setup_state, turn loop, timeouts
- Instrument tool_env.py: tool call tracking with duration and errors
- Add braintrust as optional dependency group [braintrust]
Span hierarchy per rollout:
rollout (task) -> setup_state (task) -> turn_N (task) ->
model_request (llm) + tool_call (tool) -> scoring (score)
Activation: set BRAINTRUST_API_KEY env var. No-op when unset.
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
- Fix critical span nesting: rollout span is now stashed on env instance via _bt_pending_rollout_span so multiturn_env.rollout() can attach it to state immediately after init_state, before any child spans are created - Add ty override for braintrust_tracing.py to suppress unresolved-import (braintrust is an optional dependency, not installed in CI) - Remove stale type: ignore comments that ty doesn't recognize Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
… passing Replace self._bt_pending_rollout_span with a contextvars.ContextVar so concurrent rollouts don't overwrite each other's spans. Each coroutine gets its own copy of the context variable, making it safe for the concurrent rollout pattern in generate(). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…racing Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
_run_group_states calls self.rollout() directly (bypassing _run_rollout_state), so no rollout spans were created in grouped mode. Wrap each rollout in a _traced_rollout helper that creates a child span under the group root, sets the ContextVar, and finalizes the span with completion data. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…l after scoring 1. get_model_response: catch BaseException (incl. CancelledError) so response is never referenced unbound in the finally block. 2. _run_group_states: defer rollout_completed calls until after score_group so reward values are available in the spans. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Each generate() call auto-generates a unique tag (run-<epoch>-<short_uuid>) that is attached to every root span (rollout, group, generate). This lets users filter and group traces by eval run in the Braintrust UI. Tags can also be set manually via _bt.set_run_tags(['my-tag']). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
1. _run_tags is now a ContextVar so concurrent generate() calls each get their own isolated tag set instead of overwriting a shared global. 2. set_run_tags() is only called when _bt.enabled() is True, avoiding unnecessary UUID generation and log noise when Braintrust is off. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
If client.get_response() raises and repr(exc) also raises, response would be unbound. Adding response=None and an 'is not None' guard prevents a NameError from masking the original exception. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Only one except clause fires per try block, so error_msg is always empty when BaseException (non-Exception) handler runs. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
…cing/ Revert original environment.py, multiturn_env.py, tool_env.py, and stateful_tool_env.py back to main. All braintrust-instrumented variants now live under verifiers/envs/experimental/braintrust_tracing/ as drop-in replacements. Usage: from verifiers.envs.experimental.braintrust_tracing.stateful_tool_env import StatefulToolEnv Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Pre-allocate lists and assign by index instead of appending, so span attribution is correct regardless of task scheduling order. Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Use list[object | None] so ty accepts [None] * len(group_inputs). Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
Co-Authored-By: Christian Reetz <cdreetz@gmail.com>
| # each get their own tag set without overwriting each other. | ||
| _run_tags: contextvars.ContextVar[list[str]] = contextvars.ContextVar( | ||
| "_run_tags", default=[] | ||
| ) |
There was a problem hiding this comment.
Mutable default in ContextVar creates shared state risk
Low Severity
The _run_tags ContextVar uses default=[], which is a single mutable list object shared across all contexts that haven't explicitly called .set(). While current callers always copy the returned list before use, any future code that does _run_tags.get().append(...) would silently corrupt the shared default, affecting all contexts. The safe pattern is to use a sentinel or immutable default (e.g., default=()) and convert to a list where needed.
Reviewed by Cursor Bugbot for commit b0460f6. Configure here.
| error=repr(state["error"])[:500] if state.get("error") else "", | ||
| input_tokens=float(usage.get("input_tokens", 0)), | ||
| output_tokens=float(usage.get("output_tokens", 0)), | ||
| ) |
There was a problem hiding this comment.
Rollout span leaks when rollout or scoring raises
Medium Severity
In _run_rollout_state, the calls to self.rollout(), self.rubric.score_rollout(), and self.rubric.cleanup() are not wrapped in a try/finally that ensures _bt.rollout_completed() is called. If any of these raise an exception, the Braintrust rollout span is never closed, leading to leaked open spans. The same pattern applies in _run_group_states where _traced_rollout doesn't protect against exceptions from self.rollout().
Reviewed by Cursor Bugbot for commit b0460f6. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5558e02. Configure here.
| await self.rubric.dummy_score_group(group_states) | ||
| end_scoring = time.time() | ||
| for state in group_states: | ||
| state["timing"].scoring.end = end_scoring |
There was a problem hiding this comment.
Group scoring path missing Braintrust scoring child spans
Low Severity
In _run_rollout_state, scoring is wrapped with scoring_started/scoring_completed spans, but _run_group_states performs group scoring without any corresponding Braintrust scoring spans. Since group scoring is the default path (used when independent_scoring=False), most evaluations will be missing the "scoring" child span in their traces, making the tracing hierarchy inconsistent.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5558e02. Configure here.


Summary
Adds opt-in Braintrust tracing to verifiers as experimental drop-in replacements for the core environment classes, living entirely under
verifiers/envs/experimental/braintrust_tracing/. The original core files (environment.py,multiturn_env.py,tool_env.py,stateful_tool_env.py) are unchanged frommain— no tracing code touches production paths.When
BRAINTRUST_API_KEYis set, every rollout emits a trace with child spans for setup, turns, model requests, tool calls, and scoring. When unset, all tracing is a no-op with near-zero overhead.Usage
Structure
Inheritance chain within the experimental folder:
Environment → MultiTurnEnv → ToolEnv → StatefulToolEnv. Each class inherits from its experimental sibling (not from the originalvf.*classes), so the full tracing instrumentation propagates through the hierarchy.braintrust_tracing.pymoduleA self-contained tracing layer with:
_safe()/_safe_response()serialization helpers for Pydantic models and arbitrary objectscontextvars.ContextVarfor passing the rollout span from_run_rollout_state→rollout()across the await boundary (concurrent-rollout safe)generate()call auto-generates a unique tag (run-<epoch>-<short_uuid>) attached to every root span (rollout, group, generate), enabling filter/group by eval run in Braintrust. Tags stored in aContextVarfor concurrent safety. Can also be set manually viaset_run_tags(['my-tag']).Instrumented spans
environment.py—get_model_response(LLM spans with token counts),_run_rollout_state(rollout + scoring spans),_run_group_states(group spans with per-rollout child spans),generate(run tag lifecycle + flush on exit)multiturn_env.py—rollout()setup spans, per-turn spans (with model/env duration breakdown), timeout loggingtool_env.py—call_tool()tool spans with args/result/duration;env_response()passesstate=tocall_toolfor span parentingstateful_tool_env.py—env_response()passesstate=statetocall_tool, enabling tool tracing for allStatefulToolEnvsubclassespyproject.tomlchangesbraintrustadded as optional dep group[braintrust][[tool.ty.overrides]]added forverifiers/envs/experimental/braintrust_tracing/**to suppressunresolved-import(braintrust is optional, not installed in CI)Review & Testing Checklist
experimental/braintrust_tracing/are full copies of their core counterparts with tracing added. Any future changes to the coreenvironment.py,multiturn_env.py,tool_env.py, orstateful_tool_env.pymust be manually mirrored to the experimental copies. Verify this maintenance burden is acceptable vs. a mixin/decorator approach.ToolEnv(MultiTurnEnv)whereMultiTurnEnvis the local tracing variant). But decorators like@vf.stop,@vf.cleanupand types likevf.State,vf.Errorstill come from the originalverifierspackage. Verify these cross-references work correctly at runtime.call_toolpublic API change:call_tool(**kwargs)readskwargs.get("state")for span parenting, andenv_response()passesstate=state. Any third-party subclass that overridescall_tooland callssuper().call_tool(...)without forwardingstatewill lose tool span nesting. Verify for downstream environments (e.g.,MiniBrowseEnv).pip install -e '.[braintrust]', setBRAINTRUST_API_KEY, importStatefulToolEnvfrom the experimental path, and run an eval. Confirm nested spans (rollout → setup → turns → model/tool calls) appear in Braintrust with correct run-level tags.Notes
Note
Medium Risk
Adds a large new experimental environment stack plus an optional
braintrustdependency; while isolated from existing paths, it introduces new concurrency/contextvar-based tracing logic and substantial duplicated code that could drift from core behavior.Overview
Adds an opt-in experimental Braintrust tracing implementation under
verifiers/envs/experimental/braintrust_tracing/, providing drop-inEnvironment/MultiTurnEnv/ToolEnv/StatefulToolEnvvariants that emit nested spans for rollout setup, per-turn execution, model requests (with token/duration metrics), tool calls, and scoring.Introduces a self-contained
braintrust_tracing.pylayer with lazy logger init gated byBRAINTRUST_API_KEY, run-level tagging viaContextVar, best-effort serialization, and defensive no-op/error-swallowing behavior; the experimentalenvironment.pywires these hooks intogenerate(), per-rollout/group execution, and request/tool paths.Updates packaging to add a
braintrustoptional extra (verifiers[braintrust]), addstyoverrides to ignore optional imports in the experimental folder, and refreshesuv.lockto includebraintrustand its transitive dependencies.Reviewed by Cursor Bugbot for commit 5558e02. Bugbot is set up for automated code reviews on this repo. Configure here.