You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+18-16Lines changed: 18 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,14 +10,17 @@ full operations manual.
10
10
- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
11
11
- Prefer a **remote execution environment** (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See `docs/DAYTONA.md`.
12
12
- Set **parallelism based on your own account and model limits**. Avoid exceeding documented concurrency or rate caps for your environment or provider.
13
+
- Before launching any benchmark batch, check account readiness with `python3 scripts/check_infra.py` or `python3 scripts/account_health.py status`. Do not assume OAuth accounts are usable just because credentials exist.
13
14
14
-
## Beads Prerequisite
15
+
## Beads Prerequisite and Usage
15
16
- Keep the Beads CLI (`bd`, alias `beads`) up to date before running agent workflows that rely on task graphs.
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
83
96
-`trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
84
97
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85
-
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86
-
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
98
+
- Token usage data lives in `trajectory.json`; plain transcript parsers do not see it.
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93
106
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94
107
95
-
### Gitignore
96
-
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
97
-
98
108
### Git / Auth
99
109
-`gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
100
110
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
111
+
- Account readiness is tracked in `runs/state/account_health.json`. Launchers source `configs/_common.sh`, filter out unsafe accounts before launch, and record recent runtime rate-limit observations there for operator context.
101
112
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
102
113
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
103
114
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
-`with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
109
120
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
110
121
111
-
### Dashboard / Streamlit
112
-
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
113
-
-`st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
114
-
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
115
-
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116
-
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117
-
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118
-
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
119
-
120
122
### LLM Judge
121
123
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
122
124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
Copy file name to clipboardExpand all lines: CLAUDE.md
+18-16Lines changed: 18 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,14 +10,17 @@ full operations manual.
10
10
- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
11
11
- Prefer a **remote execution environment** (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See `docs/DAYTONA.md`.
12
12
- Set **parallelism based on your own account and model limits**. Avoid exceeding documented concurrency or rate caps for your environment or provider.
13
+
- Before launching any benchmark batch, check account readiness with `python3 scripts/check_infra.py` or `python3 scripts/account_health.py status`. Do not assume OAuth accounts are usable just because credentials exist.
13
14
14
-
## Beads Prerequisite
15
+
## Beads Prerequisite and Usage
15
16
- Keep the Beads CLI (`bd`, alias `beads`) up to date before running agent workflows that rely on task graphs.
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
83
96
-`trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
84
97
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85
-
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86
-
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
98
+
- Token usage data lives in `trajectory.json`; plain transcript parsers do not see it.
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93
106
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94
107
95
-
### Gitignore
96
-
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
97
-
98
108
### Git / Auth
99
109
-`gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
100
110
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
111
+
- Account readiness is tracked in `runs/state/account_health.json`. Launchers source `configs/_common.sh`, filter out unsafe accounts before launch, and record recent runtime rate-limit observations there for operator context.
101
112
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
102
113
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
103
114
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
-`with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
109
120
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
110
121
111
-
### Dashboard / Streamlit
112
-
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
113
-
-`st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
114
-
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
115
-
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116
-
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117
-
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118
-
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
119
-
120
122
### LLM Judge
121
123
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
122
124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
0 commit comments