Skip to content

[codex] Add EvoClaw benchmark inference#705

Draft
xingyaoww wants to merge 6 commits into
mainfrom
codex/evoclaw-benchmark
Draft

[codex] Add EvoClaw benchmark inference#705
xingyaoww wants to merge 6 commits into
mainfrom
codex/evoclaw-benchmark

Conversation

@xingyaoww
Copy link
Copy Markdown
Contributor

Summary

  • add an EvoClaw inference entrypoint that follows the OpenHands benchmarks Evaluation flow
  • discover EvoClaw repo directories from --data-root and launch DockerDevWorkspace from each EvoClaw base image
  • upload task queue/SRS materials into the agent-server workspace and run the standard OpenHands SDK Agent/Conversation loop
  • emit git patches and conversation trajectories through the existing benchmark output writer

Motivation

This keeps the OpenHands agent implementation on the OpenHands benchmarks side instead of injecting an EvoClaw-owned SDK runner into an existing container. EvoClaw can be exercised through the same workspace + agent server model used by the benchmark suite.

Notes

  • This is an inference harness; it does not reimplement EvoClaw's DAG grader in this PR.
  • The local environment does not currently include an EvoClaw-data checkout with metadata.json instances, so validation here is static/entrypoint-level.

Validation

  • uv run --no-sync python -m py_compile benchmarks/evoclaw/run_infer.py benchmarks/evoclaw/config.py
  • UV_CACHE_DIR=/mnt/data/evocloud/.uv-cache uv run --no-project --with ruff ruff check benchmarks/evoclaw/run_infer.py benchmarks/evoclaw/config.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant