Skip to content

Commit 90a6448

Browse files
sjarmakclaude
andcommitted
feat: task selection, suite weights, and discovery for unified CSB benchmark
- configs/selected_csb_tasks.json: 275 tasks across 9 merged suites - scripts/generate_csb_selection.py: generates selection from benchmarks/csb/ - scripts/promoted_verifier.py: added csb_* suite weights (kept legacy keys for compat) - Updated script registry and agent navigation Usage: --selection-file configs/selected_csb_tasks.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a6351e5 commit 90a6448

File tree

68 files changed

+63506
-2
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+63506
-2
lines changed

configs/selected_csb_tasks.json

Lines changed: 18294 additions & 0 deletions
Large diffs are not rendered by default.

docs/OPENHANDS.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# OpenHands Harness
2+
3+
Run CodeScaleBench tasks using the OpenHands agent instead of Claude Code.
4+
5+
## Prerequisites
6+
7+
- Harbor CLI installed (`uv tool install harbor`)
8+
- Docker running locally or `HARBOR_ENV=daytona` for cloud execution
9+
- `.env.local` at project root with credentials (see Auth Setup below)
10+
11+
## Model Configuration
12+
13+
The `MODEL` env var accepts any LiteLLM-format string (`provider/model-name`):
14+
15+
| Model | MODEL value | Short name |
16+
|-------|------------|------------|
17+
| Opus 4.6 (default) | `anthropic/claude-opus-4-6` | `opus46` |
18+
| Sonnet 4.6 | `anthropic/claude-sonnet-4-6` | `sonnet46` |
19+
| Sonnet 4.5 | `anthropic/claude-sonnet-4-5-20241022` | `sonnet45` |
20+
| Haiku 4.5 | `anthropic/claude-haiku-4-5-20251001` | `haiku45` |
21+
| GPT-4o | `openai/gpt-4o` | `gpt4o` |
22+
| Codex | `openai/gpt-5.3-codex` | `gpt53codex` |
23+
24+
The short name determines the run directory name (e.g. `runs/staging/openhands_sonnet46_20260306_120000/`).
25+
26+
## Auth Setup
27+
28+
### Anthropic Models (OAuth Subscription)
29+
30+
The project uses Claude Max subscription tokens, not API keys. The agent reads the OAuth access token from `~/.claude/.credentials.json` and injects it into `ANTHROPIC_API_KEY` so Harbor's resolver can find it.
31+
32+
Ensure tokens are fresh before launching:
33+
```bash
34+
source configs/_common.sh
35+
load_credentials
36+
ensure_fresh_token_all
37+
```
38+
39+
If `ANTHROPIC_API_KEY` is explicitly set in `.env.local`, it takes precedence over OAuth.
40+
41+
### OpenAI Models
42+
43+
Set `OPENAI_API_KEY` in `.env.local`. For Codex models, you can also use `CODEX_API_KEY`.
44+
45+
## Example Commands
46+
47+
```bash
48+
# Full 2-config run (baseline + MCP) with Sonnet 4.6
49+
MODEL=anthropic/claude-sonnet-4-6 ./configs/openhands_2config.sh
50+
51+
# Baseline-only with Opus 4.6 (default model)
52+
./configs/openhands_2config.sh --baseline-only
53+
54+
# Single task
55+
./configs/openhands_2config.sh --benchmark csb_sdlc_fix --task my-task-001
56+
57+
# Override parallelism
58+
./configs/openhands_2config.sh --parallel 4
59+
60+
# GPT-4o run
61+
MODEL=openai/gpt-4o ./configs/openhands_2config.sh --baseline-only
62+
```
63+
64+
## Run Directory Structure
65+
66+
```
67+
runs/staging/openhands_sonnet46_20260306_120000/
68+
baseline-local-direct/
69+
task-name__abcd1234/
70+
result.json
71+
task-name.log
72+
mcp-remote-direct/
73+
task-name__abcd1234/
74+
result.json
75+
task-name.log
76+
```
77+
78+
## Architecture
79+
80+
- OpenHands runs **inside the Docker container** (installed by Harbor's template), not on the host
81+
- `agents/harnesses/openhands/agent.py` extends Harbor's built-in `OpenHands` agent + `BaselineHarnessMixin`
82+
- `BaselineHarnessMixin` (`agents/harnesses/base.py`) handles instruction preparation, MCP configuration, and container env propagation
83+
- The 2-config launcher (`configs/openhands_2config.sh`) runs baseline (no MCP) then MCP-Full (Sourcegraph)
84+
85+
## Known Limitations
86+
87+
- Codex models require the `openai/` prefix for LiteLLM; the agent adds this automatically
88+
- OAuth tokens expire after ~8 hours; long runs should call `ensure_fresh_token_all` between batches
89+
- OpenHands agent does not support `CLAUDE_CODE_OAUTH_TOKEN` — it uses `LLM_API_KEY` for all providers

0 commit comments

Comments
 (0)