Split composable harnesses and tasksets into packages#1235
Split composable harnesses and tasksets into packages#1235
Conversation
| for prefix in prefixes: | ||
| if key.startswith(prefix) and key != "agent_workdir": | ||
| harness_config.setdefault(key.removeprefix(prefix), value) | ||
| break |
There was a problem hiding this comment.
Prefix stripping retains original prefixed keys in config
Medium Severity
The prefix-stripping loop in build_harness_from_config uses setdefault to add the unprefixed key but never removes the original prefixed key from harness_config. When the factory doesn't accept **kwargs, the prefixed key is silently filtered out at lines 95-99. But if it does accept **kwargs, both claude_reasoning_effort and reasoning_effort are passed, which could cause unexpected behavior or TypeError in factories that don't expect the prefixed form.
Reviewed by Cursor Bugbot for commit 73de671. Configure here.
73de671 to
855d3b9
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 4 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.
| if char == '"': | ||
| in_string = not in_string | ||
| continue | ||
| if in_string: |
There was a problem hiding this comment.
JSON parser skips characters after backslashes outside strings
Medium Severity
extract_json_content treats \ as an escape character globally, not just inside JSON strings. When in_string is False, a \ before {, }, or " in non-JSON model output still sets escape_next, causing the following character to be skipped. This can prevent the parser from finding the JSON object (e.g., model output containing \{ in LaTeX/regex) or mistrack string boundaries (e.g., \" outside strings skips a quote delimiter). The escape_next logic on lines 448–450 needs to be gated on in_string being True.
Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.
|
|
||
|
|
||
| class SWEBenchProEnv(NamedComposableEnv, taskset=SWEBenchProTaskSet): | ||
| """SWE-bench Pro environment composed from tasksets + harnesses.""" |
There was a problem hiding this comment.
SWEBenchProEnv omits explicit default harness config
Low Severity
SWEBenchProEnv does not pass default_harness_config to NamedComposableEnv, unlike TerminalBench2Env which explicitly sets default_harness_config={"agent": "openclaw"}. SWEBenchProEnv only works because build_harness_from_config in config.py has a hardcoded "openclaw" fallback. The README claims the default is {agent = "openclaw"}, but the class doesn't declare it. If the hardcoded fallback ever changes, SWEBenchProEnv would silently break while TerminalBench2Env would not.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.
|
|
||
|
|
||
| def load_environment(**env_args) -> SWEBenchProEnv: | ||
| return SWEBenchProEnv(**env_args) |
There was a problem hiding this comment.
Missing documentation updates for composable environment packages
Low Severity
This PR adds significant core user-facing functionality — ComposableEnv, NamedComposableEnv, the tasksets package, and the harnesses package — but does not update any documentation under docs/. At minimum, docs/environments.md (which covers environment patterns and the experimental section) needs entries for the composable environment workflow, and docs/faqs.md could note the new package split. The review rules require documentation updates when core user-facing functionality is added.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73de67144f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {shlex.quote(mini_binary)} \\ | ||
| --model "$OPENAI_MODEL" \\ | ||
| --model-class {shlex.quote(model_class)} \\ | ||
| --task "$MINI_SWE_AGENT_TASK" \\ | ||
| --output {shlex.quote(trajectory_path)} \\ |
There was a problem hiding this comment.
Reintroduce a hard timeout around mini agent execution
The mini-SWE-agent run command now launches mini directly without the previous timeout --kill-after=... guard, so when the agent exceeds the rollout timeout, CliAgentEnv.wait_for_completion marks the rollout timed out but does not terminate the background process. In timed-out runs this leaves the agent mutating the checkout while scoring runs (and keeps consuming sandbox resources), which can make rewards nondeterministic and corrupt experiment results.
Useful? React with 👍 / 👎.
| python3 -c 'import urllib.request; urllib.request.urlretrieve("https://astral.sh/uv/install.sh", "'"$UV_INSTALLER"'")' | ||
| UV_UNMANAGED_INSTALL={quoted_uv_bin_dir} sh "$UV_INSTALLER" --quiet |
There was a problem hiding this comment.
Verify the uv installer before executing it
The install script downloads https://astral.sh/uv/install.sh and executes it immediately without any pinned hash or signature verification. This introduces a supply-chain integrity gap (and non-reproducible installs) compared with the rest of the harness installers that validate artifacts before execution; a changed or tampered installer would run arbitrary shell code in every rollout sandbox.
Useful? React with 👍 / 👎.


Summary
This PR splits the composable benchmark stack out of
verifiers.envs.experimentaland into two first-class workspace packages:packages/tasksets: task collections, sandbox specs, runtime specs, Harbor-backed task loading, SWE-family tasksets, SWE-bench Pro, Terminal-Bench 2, CP, Lean, and Math.packages/harnesses: agent/harness definitions, the generic harness factory, Codex, Claude Code, OpenCode, OpenClaw, mini-SWE-agent, Pi Mono, Terminus 2, and RLM harness support.It also adds thin composable environments for SWE-bench Pro and Terminal-Bench 2, wires the new packages into the root
composableextra, and removes only the old experimental composable package copy. The legacy HarborEnv-based example envs stay in place.Why
The old structure had a few awkward couplings:
verifiers.envs.experimental, so downstream envs had to import implementation details from an experimental namespace.This PR makes the reusable pieces explicit: tasksets describe the work and runtime requirements; harnesses describe the agent side;
ComposableEnvjoins them.Package Split
tasksetsnow owns the task-facing contracts:TaskSet/SandboxTaskSetSandboxSpecTaskRuntimeSpecharnessesnow owns the agent-facing contracts:Harnessbuild_harness_from_configmake_native_harnessandmake_configurable_harnessThe root
pyproject.tomladds acomposableextra and registers both packages as uv workspace members/sources, so local env packages can depend onverifiers[composable],harnesses, andtasksetswithout reaching into experimental internals.Generic Harness Config
The composable env path now resolves harnesses through
harnesses.build_harness_from_configinstead of hard-coding SWE-bench Pro harness branches.Existing flat TOML style continues to map into env args/harness args through the same normalization path, but this PR does not add eval config files. New harnesses can also be provided with a generic factory path, so composable envs do not need custom code for each scaffold.
Tool Channel
Tasksets can now expose tools through a first-class
TaskToolschannel. The current transport is MCP because that is what the CLI harnesses already understand.The lifecycle is per rollout:
ComposableEnvprepares those tools before sandbox upload/install.Harness.with_tools(...).This keeps the taskset as the owner of task-specific tool details and the harness as the owner of how those tools are registered with a given agent. OpenCode still supports its disabled-tools list, so existing OpenCode TOMLs keep their current controls.
Skill Channel
Skills are split out from tools as their own first-class
TaskSkillschannel rather than being folded into a generic task-tools object.A taskset can provide a source skill directory plus the sandbox destination directory.
ComposableEnvhandles the upload mapping and the harness receives the resulting skills path throughHarness.with_skills(...). Harnesses that know how to consume skill directories can put that path into their runtime config; harnesses that do not use skills simply ignore the channel.This mirrors the tool design but avoids pretending that skills and MCP servers have the same lifecycle. Tools are usually server specs plus env vars; skills are files that need upload/resolution into the sandbox before the agent starts.
Environment Changes
SWE-bench Pro and Terminal-Bench 2 are now intentionally thin:
SWEBenchProEnvisNamedComposableEnvoverSWEBenchProTaskSet.TerminalBench2EnvisNamedComposableEnvoverTerminalBench2TaskSet.The verbose environment constructors are gone. Pydantic-backed env-arg normalization maps TOML/CLI args into taskset args, harness config, and
ComposableEnvargs directly.CliAgentEnvalso gets small extension points for dynamic run commands, start commands, env vars, and agent timeout handling so composable harnesses can vary behavior without subclassing the whole env.Cleanup
This removes the old experimental composable package copy:
verifiers/envs/experimental/composableThe legacy HarborEnv implementation and existing HarborEnv example envs are intentionally preserved:
verifiers/envs/experimental/harbor_envenvironments/opencode_harborenvironments/terminus_harborenvironments/hello_mcp_harborShared checkout/file-lock helpers moved from
verifiers.envs.experimental.utilstoverifiers.utils.Testing
Ran locally:
Pre-push also ran ruff and ty successfully.