Skip to content

Split composable harnesses and tasksets into packages#1235

Open
xeophon wants to merge 1 commit intomainfrom
codex/composable-packages-pr
Open

Split composable harnesses and tasksets into packages#1235
xeophon wants to merge 1 commit intomainfrom
codex/composable-packages-pr

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented Apr 23, 2026

Summary

This PR splits the composable benchmark stack out of verifiers.envs.experimental and into two first-class workspace packages:

  • packages/tasksets: task collections, sandbox specs, runtime specs, Harbor-backed task loading, SWE-family tasksets, SWE-bench Pro, Terminal-Bench 2, CP, Lean, and Math.
  • packages/harnesses: agent/harness definitions, the generic harness factory, Codex, Claude Code, OpenCode, OpenClaw, mini-SWE-agent, Pi Mono, Terminus 2, and RLM harness support.

It also adds thin composable environments for SWE-bench Pro and Terminal-Bench 2, wires the new packages into the root composable extra, and removes only the old experimental composable package copy. The legacy HarborEnv-based example envs stay in place.

Why

The old structure had a few awkward couplings:

  • Tasksets, harnesses, and env lifecycle code all lived under verifiers.envs.experimental, so downstream envs had to import implementation details from an experimental namespace.
  • SWE-bench Pro had environment-specific harness construction logic, which meant every new harness wanted another special case.
  • Tasksets could influence prompts and sandbox setup, but they could not cleanly expose tools or skills for harnesses that support MCP-style or skill-directory transports.
  • Harbor-specific behavior was duplicated in the composable tasksets path instead of being reused as taskset primitives.

This PR makes the reusable pieces explicit: tasksets describe the work and runtime requirements; harnesses describe the agent side; ComposableEnv joins them.

Package Split

tasksets now owns the task-facing contracts:

  • TaskSet / SandboxTaskSet
  • SandboxSpec
  • TaskRuntimeSpec
  • task upload directories
  • task-provided environment variables
  • task-provided tools and skills
  • Harbor task parsing and Terminal-Bench 2 loading
  • SWE tasksets, including the mainline SWE-family tasksets that had been added back under experimental

harnesses now owns the agent-facing contracts:

  • Harness
  • generic build_harness_from_config
  • make_native_harness and make_configurable_harness
  • harness-specific environment variables, upload mappings, install scripts, run commands, metrics paths, and post-install hooks
  • MCP-capable harness registration for Codex, Claude Code, OpenCode, and OpenClaw

The root pyproject.toml adds a composable extra and registers both packages as uv workspace members/sources, so local env packages can depend on verifiers[composable], harnesses, and tasksets without reaching into experimental internals.

Generic Harness Config

The composable env path now resolves harnesses through harnesses.build_harness_from_config instead of hard-coding SWE-bench Pro harness branches.

Existing flat TOML style continues to map into env args/harness args through the same normalization path, but this PR does not add eval config files. New harnesses can also be provided with a generic factory path, so composable envs do not need custom code for each scaffold.

Tool Channel

Tasksets can now expose tools through a first-class TaskTools channel. The current transport is MCP because that is what the CLI harnesses already understand.

The lifecycle is per rollout:

  1. The taskset inspects the task state and returns the MCP server specs/env vars it needs.
  2. ComposableEnv prepares those tools before sandbox upload/install.
  3. The harness receives the task tools through Harness.with_tools(...).
  4. MCP-aware harnesses fold those servers into their native config/command shape.

This keeps the taskset as the owner of task-specific tool details and the harness as the owner of how those tools are registered with a given agent. OpenCode still supports its disabled-tools list, so existing OpenCode TOMLs keep their current controls.

Skill Channel

Skills are split out from tools as their own first-class TaskSkills channel rather than being folded into a generic task-tools object.

A taskset can provide a source skill directory plus the sandbox destination directory. ComposableEnv handles the upload mapping and the harness receives the resulting skills path through Harness.with_skills(...). Harnesses that know how to consume skill directories can put that path into their runtime config; harnesses that do not use skills simply ignore the channel.

This mirrors the tool design but avoids pretending that skills and MCP servers have the same lifecycle. Tools are usually server specs plus env vars; skills are files that need upload/resolution into the sandbox before the agent starts.

Environment Changes

SWE-bench Pro and Terminal-Bench 2 are now intentionally thin:

  • SWEBenchProEnv is NamedComposableEnv over SWEBenchProTaskSet.
  • TerminalBench2Env is NamedComposableEnv over TerminalBench2TaskSet.

The verbose environment constructors are gone. Pydantic-backed env-arg normalization maps TOML/CLI args into taskset args, harness config, and ComposableEnv args directly.

CliAgentEnv also gets small extension points for dynamic run commands, start commands, env vars, and agent timeout handling so composable harnesses can vary behavior without subclassing the whole env.

Cleanup

This removes the old experimental composable package copy:

  • verifiers/envs/experimental/composable

The legacy HarborEnv implementation and existing HarborEnv example envs are intentionally preserved:

  • verifiers/envs/experimental/harbor_env
  • environments/opencode_harbor
  • environments/terminus_harbor
  • environments/hello_mcp_harbor

Shared checkout/file-lock helpers moved from verifiers.envs.experimental.utils to verifiers.utils.

Testing

Ran locally:

uv run ruff format && uv run ruff check --fix
uv run ty check packages/harnesses packages/tasksets verifiers/envs/composable_env.py verifiers/envs/composable_skills.py verifiers/envs/composable_tools.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/harbor_env
uv run pytest tests/test_harnesses_package.py tests/test_composable_env.py tests/test_tasksets_package.py tests/test_rlm_composable_env.py tests/test_cli_agent_env.py tests/test_harbor_env_mcp.py tests/test_opencode_harbor.py tests/test_envs.py -m 'not slow'

Pre-push also ran ruff and ty successfully.

for prefix in prefixes:
if key.startswith(prefix) and key != "agent_workdir":
harness_config.setdefault(key.removeprefix(prefix), value)
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix stripping retains original prefixed keys in config

Medium Severity

The prefix-stripping loop in build_harness_from_config uses setdefault to add the unprefixed key but never removes the original prefixed key from harness_config. When the factory doesn't accept **kwargs, the prefixed key is silently filtered out at lines 95-99. But if it does accept **kwargs, both claude_reasoning_effort and reasoning_effort are passed, which could cause unexpected behavior or TypeError in factories that don't expect the prefixed form.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 73de671. Configure here.

@xeophon xeophon force-pushed the codex/composable-packages-pr branch from 73de671 to 855d3b9 Compare April 23, 2026 18:37
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 4 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.

if char == '"':
in_string = not in_string
continue
if in_string:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON parser skips characters after backslashes outside strings

Medium Severity

extract_json_content treats \ as an escape character globally, not just inside JSON strings. When in_string is False, a \ before {, }, or " in non-JSON model output still sets escape_next, causing the following character to be skipped. This can prevent the parser from finding the JSON object (e.g., model output containing \{ in LaTeX/regex) or mistrack string boundaries (e.g., \" outside strings skips a quote delimiter). The escape_next logic on lines 448–450 needs to be gated on in_string being True.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.



class SWEBenchProEnv(NamedComposableEnv, taskset=SWEBenchProTaskSet):
"""SWE-bench Pro environment composed from tasksets + harnesses."""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWEBenchProEnv omits explicit default harness config

Low Severity

SWEBenchProEnv does not pass default_harness_config to NamedComposableEnv, unlike TerminalBench2Env which explicitly sets default_harness_config={"agent": "openclaw"}. SWEBenchProEnv only works because build_harness_from_config in config.py has a hardcoded "openclaw" fallback. The README claims the default is {agent = "openclaw"}, but the class doesn't declare it. If the hardcoded fallback ever changes, SWEBenchProEnv would silently break while TerminalBench2Env would not.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.



def load_environment(**env_args) -> SWEBenchProEnv:
return SWEBenchProEnv(**env_args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation updates for composable environment packages

Low Severity

This PR adds significant core user-facing functionality — ComposableEnv, NamedComposableEnv, the tasksets package, and the harnesses package — but does not update any documentation under docs/. At minimum, docs/environments.md (which covers environment patterns and the experimental section) needs entries for the composable environment workflow, and docs/faqs.md could note the new package split. The review rules require documentation updates when core user-facing functionality is added.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73de67144f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +158 to 162
{shlex.quote(mini_binary)} \\
--model "$OPENAI_MODEL" \\
--model-class {shlex.quote(model_class)} \\
--task "$MINI_SWE_AGENT_TASK" \\
--output {shlex.quote(trajectory_path)} \\
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce a hard timeout around mini agent execution

The mini-SWE-agent run command now launches mini directly without the previous timeout --kill-after=... guard, so when the agent exceeds the rollout timeout, CliAgentEnv.wait_for_completion marks the rollout timed out but does not terminate the background process. In timed-out runs this leaves the agent mutating the checkout while scoring runs (and keeps consuming sandbox resources), which can make rewards nondeterministic and corrupt experiment results.

Useful? React with 👍 / 👎.

Comment on lines +70 to +71
python3 -c 'import urllib.request; urllib.request.urlretrieve("https://astral.sh/uv/install.sh", "'"$UV_INSTALLER"'")'
UV_UNMANAGED_INSTALL={quoted_uv_bin_dir} sh "$UV_INSTALLER" --quiet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify the uv installer before executing it

The install script downloads https://astral.sh/uv/install.sh and executes it immediately without any pinned hash or signature verification. This introduces a supply-chain integrity gap (and non-reproducible installs) compared with the rest of the harness installers that validate artifacts before execution; a changed or tampered installer would run arbitrary shell code in every rollout sandbox.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant