Split composable harnesses and tasksets into packages by xeophon · Pull Request #1235 · PrimeIntellect-ai/verifiers

xeophon · 2026-04-23T18:33:02Z

Summary

This PR splits the composable benchmark stack out of verifiers.envs.experimental and into two first-class workspace packages:

packages/tasksets: task collections, sandbox specs, runtime specs, Harbor-backed task loading, SWE-family tasksets, SWE-bench Pro, Terminal-Bench 2, CP, Lean, and Math.
packages/harnesses: agent/harness definitions, the generic harness factory, Codex, Claude Code, OpenCode, OpenClaw, mini-SWE-agent, Pi Mono, Terminus 2, and RLM harness support.

It also adds thin composable environments for SWE-bench Pro and Terminal-Bench 2, wires the new packages into the root composable extra, and removes only the old experimental composable package copy. The legacy HarborEnv-based example envs stay in place.

Why

The old structure had a few awkward couplings:

Tasksets, harnesses, and env lifecycle code all lived under verifiers.envs.experimental, so downstream envs had to import implementation details from an experimental namespace.
SWE-bench Pro had environment-specific harness construction logic, which meant every new harness wanted another special case.
Tasksets could influence prompts and sandbox setup, but they could not cleanly expose tools or skills for harnesses that support MCP-style or skill-directory transports.
Harbor-specific behavior was duplicated in the composable tasksets path instead of being reused as taskset primitives.

This PR makes the reusable pieces explicit: tasksets describe the work and runtime requirements; harnesses describe the agent side; ComposableEnv joins them.

Package Split

tasksets now owns the task-facing contracts:

TaskSet / SandboxTaskSet
SandboxSpec
TaskRuntimeSpec
task upload directories
task-provided environment variables
task-provided tools and skills
Harbor task parsing and Terminal-Bench 2 loading
SWE tasksets, including the mainline SWE-family tasksets that had been added back under experimental

harnesses now owns the agent-facing contracts:

Harness
generic build_harness_from_config
make_native_harness and make_configurable_harness
harness-specific environment variables, upload mappings, install scripts, run commands, metrics paths, and post-install hooks
MCP-capable harness registration for Codex, Claude Code, OpenCode, and OpenClaw

The root pyproject.toml adds a composable extra and registers both packages as uv workspace members/sources, so local env packages can depend on verifiers[composable], harnesses, and tasksets without reaching into experimental internals.

Generic Harness Config

The composable env path now resolves harnesses through harnesses.build_harness_from_config instead of hard-coding SWE-bench Pro harness branches.

Existing flat TOML style continues to map into env args/harness args through the same normalization path, but this PR does not add eval config files. New harnesses can also be provided with a generic factory path, so composable envs do not need custom code for each scaffold.

Tool Channel

Tasksets can now expose tools through a first-class TaskTools channel. The current transport is MCP because that is what the CLI harnesses already understand.

The lifecycle is per rollout:

The taskset inspects the task state and returns the MCP server specs/env vars it needs.
ComposableEnv prepares those tools before sandbox upload/install.
The harness receives the task tools through Harness.with_tools(...).
MCP-aware harnesses fold those servers into their native config/command shape.

This keeps the taskset as the owner of task-specific tool details and the harness as the owner of how those tools are registered with a given agent. OpenCode still supports its disabled-tools list, so existing OpenCode TOMLs keep their current controls.

Skill Channel

Skills are split out from tools as their own first-class TaskSkills channel rather than being folded into a generic task-tools object.

A taskset can provide a source skill directory plus the sandbox destination directory. ComposableEnv handles the upload mapping and the harness receives the resulting skills path through Harness.with_skills(...). Harnesses that know how to consume skill directories can put that path into their runtime config; harnesses that do not use skills simply ignore the channel.

This mirrors the tool design but avoids pretending that skills and MCP servers have the same lifecycle. Tools are usually server specs plus env vars; skills are files that need upload/resolution into the sandbox before the agent starts.

Environment Changes

SWE-bench Pro and Terminal-Bench 2 are now intentionally thin:

SWEBenchProEnv is NamedComposableEnv over SWEBenchProTaskSet.
TerminalBench2Env is NamedComposableEnv over TerminalBench2TaskSet.

The verbose environment constructors are gone. Pydantic-backed env-arg normalization maps TOML/CLI args into taskset args, harness config, and ComposableEnv args directly.

CliAgentEnv also gets small extension points for dynamic run commands, start commands, env vars, and agent timeout handling so composable harnesses can vary behavior without subclassing the whole env.

Cleanup

This removes the old experimental composable package copy:

verifiers/envs/experimental/composable

The legacy HarborEnv implementation and existing HarborEnv example envs are intentionally preserved:

verifiers/envs/experimental/harbor_env
environments/opencode_harbor
environments/terminus_harbor
environments/hello_mcp_harbor

Shared checkout/file-lock helpers moved from verifiers.envs.experimental.utils to verifiers.utils.

Testing

Ran locally:

uv run ruff format && uv run ruff check --fix
uv run ty check packages/harnesses packages/tasksets verifiers/envs/composable_env.py verifiers/envs/composable_skills.py verifiers/envs/composable_tools.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/harbor_env
uv run pytest tests/test_harnesses_package.py tests/test_composable_env.py tests/test_tasksets_package.py tests/test_rlm_composable_env.py tests/test_cli_agent_env.py tests/test_harbor_env_mcp.py tests/test_opencode_harbor.py tests/test_envs.py -m 'not slow'

Pre-push also ran ruff and ty successfully.

cursor · 2026-04-23T18:34:23Z

+        for prefix in prefixes:
+            if key.startswith(prefix) and key != "agent_workdir":
+                harness_config.setdefault(key.removeprefix(prefix), value)
+                break


Prefix stripping retains original prefixed keys in config

Medium Severity

The prefix-stripping loop in build_harness_from_config uses setdefault to add the unprefixed key but never removes the original prefixed key from harness_config. When the factory doesn't accept **kwargs, the prefixed key is silently filtered out at lines 95-99. But if it does accept **kwargs, both claude_reasoning_effort and reasoning_effort are passed, which could cause unexpected behavior or TypeError in factories that don't expect the prefixed form.

^{Reviewed by Cursor Bugbot for commit 73de671. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 4 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.}

cursor · 2026-04-23T18:41:36Z

+        if char == '"':
+            in_string = not in_string
+            continue
+        if in_string:


JSON parser skips characters after backslashes outside strings

Medium Severity

extract_json_content treats \ as an escape character globally, not just inside JSON strings. When in_string is False, a \ before {, }, or " in non-JSON model output still sets escape_next, causing the following character to be skipped. This can prevent the parser from finding the JSON object (e.g., model output containing \{ in LaTeX/regex) or mistrack string boundaries (e.g., \" outside strings skips a quote delimiter). The escape_next logic on lines 448–450 needs to be gated on in_string being True.

^{Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.}

cursor · 2026-04-23T18:41:36Z

+
+
+class SWEBenchProEnv(NamedComposableEnv, taskset=SWEBenchProTaskSet):
+    """SWE-bench Pro environment composed from tasksets + harnesses."""


SWEBenchProEnv omits explicit default harness config

Low Severity

SWEBenchProEnv does not pass default_harness_config to NamedComposableEnv, unlike TerminalBench2Env which explicitly sets default_harness_config={"agent": "openclaw"}. SWEBenchProEnv only works because build_harness_from_config in config.py has a hardcoded "openclaw" fallback. The README claims the default is {agent = "openclaw"}, but the class doesn't declare it. If the hardcoded fallback ever changes, SWEBenchProEnv would silently break while TerminalBench2Env would not.

Additional Locations (1)

environments/terminal_bench_2/terminal_bench_2.py#L4-L9

^{Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.}

cursor · 2026-04-23T18:41:36Z

+
+
+def load_environment(**env_args) -> SWEBenchProEnv:
+    return SWEBenchProEnv(**env_args)


Missing documentation updates for composable environment packages

Low Severity

This PR adds significant core user-facing functionality — ComposableEnv, NamedComposableEnv, the tasksets package, and the harnesses package — but does not update any documentation under docs/. At minimum, docs/environments.md (which covers environment patterns and the experimental section) needs entries for the composable environment workflow, and docs/faqs.md could note the new package split. The review rules require documentation updates when core user-facing functionality is added.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 855d3b9. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73de67144f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T18:43:57Z

+{shlex.quote(mini_binary)} \\
  --model "$OPENAI_MODEL" \\
+  --model-class {shlex.quote(model_class)} \\
  --task "$MINI_SWE_AGENT_TASK" \\
  --output {shlex.quote(trajectory_path)} \\


Reintroduce a hard timeout around mini agent execution

The mini-SWE-agent run command now launches mini directly without the previous timeout --kill-after=... guard, so when the agent exceeds the rollout timeout, CliAgentEnv.wait_for_completion marks the rollout timed out but does not terminate the background process. In timed-out runs this leaves the agent mutating the checkout while scoring runs (and keeps consuming sandbox resources), which can make rewards nondeterministic and corrupt experiment results.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T18:43:57Z

+python3 -c 'import urllib.request; urllib.request.urlretrieve("https://astral.sh/uv/install.sh", "'"$UV_INSTALLER"'")'
+UV_UNMANAGED_INSTALL={quoted_uv_bin_dir} sh "$UV_INSTALLER" --quiet


Verify the uv installer before executing it

The install script downloads https://astral.sh/uv/install.sh and executes it immediately without any pinned hash or signature verification. This introduces a supply-chain integrity gap (and non-reproducible installs) compared with the rest of the harness installers that validate artifacts before execution; a changed or tampered installer would run arbitrary shell code in every rollout sandbox.

Useful? React with 👍 / 👎.

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Split composable harnesses and tasksets into packages

855d3b9

xeophon force-pushed the codex/composable-packages-pr branch from 73de671 to 855d3b9 Compare April 23, 2026 18:37

cursor Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split composable harnesses and tasksets into packages#1235

Split composable harnesses and tasksets into packages#1235
xeophon wants to merge 1 commit intomainfrom
codex/composable-packages-pr

xeophon commented Apr 23, 2026 •

edited

Loading

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		class SWEBenchProEnv(NamedComposableEnv, taskset=SWEBenchProTaskSet):
		"""SWE-bench Pro environment composed from tasksets + harnesses."""



		def load_environment(**env_args) -> SWEBenchProEnv:
		return SWEBenchProEnv(**env_args)

		python3 -c 'import urllib.request; urllib.request.urlretrieve("https://astral.sh/uv/install.sh", "'"$UV_INSTALLER"'")'
		UV_UNMANAGED_INSTALL={quoted_uv_bin_dir} sh "$UV_INSTALLER" --quiet

Conversation

xeophon commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Package Split

Generic Harness Config

Tool Channel

Skill Channel

Environment Changes

Cleanup

Testing

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

Prefix stripping retains original prefixed keys in config

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

JSON parser skips characters after backslashes outside strings

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

SWEBenchProEnv omits explicit default harness config

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

Missing documentation updates for composable environment packages

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeophon commented Apr 23, 2026 •

edited

Loading