Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions benchmarks/evoclaw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# EvoClaw

This benchmark entrypoint runs OpenHands against EvoClaw repositories through the
standard OpenHands benchmarks SDK path:

1. discover EvoClaw repo directories from `--data-root`,
2. build/start an OpenHands agent-server workspace from each EvoClaw base image,
3. upload the EvoClaw task queue and SRS files into the workspace,
4. run `Agent`/`Conversation` with the normal fake-user evaluation loop,
5. emit the resulting git patch and conversation trajectory.

```bash
uv run evoclaw-infer .llm_config/example.json \
--data-root /path/to/EvoClaw-data \
--repos navidrome \
--n-limit 1
```

This is currently an inference harness. It intentionally does not reimplement
EvoClaw's milestone DAG grader inside this repo.
1 change: 1 addition & 0 deletions benchmarks/evoclaw/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""EvoClaw benchmark integration."""
14 changes: 14 additions & 0 deletions benchmarks/evoclaw/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"""Defaults for EvoClaw inference."""

INFER_DEFAULTS = {
"dataset": "evoclaw",
"split": "test",
"max_iterations": 3000,
"instance_timeout": 18000,
"num_workers": 1,
"n_critic_runs": 1,
"workspace": "docker",
"enable_condenser": True,
"condenser_max_size": 100,
"condenser_keep_first": 4,
}
9 changes: 9 additions & 0 deletions benchmarks/evoclaw/prompts/default.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
We need modify the repository in /testbed to complete the EvoClaw task queue.

Task queue:
{{ task_queue_path }}

Requirements files are available under:
{{ srs_dir }}

For each listed milestone, read its SRS file, implement the requested behavior in /testbed, and run the relevant tests when practical. If all listed milestones are complete, use the finish tool.
Loading
Loading