[codex] Add EvoClaw benchmark inference by xingyaoww · Pull Request #705 · OpenHands/benchmarks

xingyaoww · 2026-05-07T22:48:44Z

Summary

add an EvoClaw inference entrypoint that follows the OpenHands benchmarks Evaluation flow
discover EvoClaw repo directories from --data-root and launch DockerDevWorkspace from each EvoClaw base image
upload task queue/SRS materials into the agent-server workspace and run the standard OpenHands SDK Agent/Conversation loop
emit git patches and conversation trajectories through the existing benchmark output writer

Motivation

This keeps the OpenHands agent implementation on the OpenHands benchmarks side instead of injecting an EvoClaw-owned SDK runner into an existing container. EvoClaw can be exercised through the same workspace + agent server model used by the benchmark suite.

Notes

This is an inference harness; it does not reimplement EvoClaw's DAG grader in this PR.
The local environment does not currently include an EvoClaw-data checkout with metadata.json instances, so validation here is static/entrypoint-level.

Validation

uv run --no-sync python -m py_compile benchmarks/evoclaw/run_infer.py benchmarks/evoclaw/config.py
UV_CACHE_DIR=/mnt/data/evocloud/.uv-cache uv run --no-project --with ruff ruff check benchmarks/evoclaw/run_infer.py benchmarks/evoclaw/config.py
git diff --check

xingyaoww added 6 commits May 7, 2026 22:48

Add EvoClaw benchmark inference

eb745ce

Fix EvoClaw benchmark metrics test

fe3d86a

Harden EvoClaw SDK workspace setup

3c5dfea

Fix EvoClaw writable path typing

ec82331

Simplify EvoClaw workspace permissions

4bdbe8b

Capture EvoClaw new files in patches

2438da5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add EvoClaw benchmark inference#705

[codex] Add EvoClaw benchmark inference#705
xingyaoww wants to merge 6 commits into
mainfrom
codex/evoclaw-benchmark

xingyaoww commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xingyaoww commented May 7, 2026

Summary

Motivation

Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant