This file is the root entrypoint for AI agents working in this repository. Keep it small. Use it to route to the right workflow and local guide, not as the full operations manual.
- All work happens on
mainby default. If you use feature branches, keep them small, short-lived, and easy to fast-forward back intomain. - Every
harbor runmust be gated by interactive confirmation. - Before commit/push, run
python3 scripts/repo_health.py(or--quickfor docs/config-only changes). - Prefer a remote execution environment (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See
docs/DAYTONA.md. - Set parallelism based on your own account and model limits. Avoid exceeding documented concurrency or rate caps for your environment or provider.
- Before launching any benchmark batch, check account readiness with
python3 scripts/check_infra.pyorpython3 scripts/account_health.py status. Do not assume OAuth accounts are usable just because credentials exist.
- Keep the Beads CLI (
bd, aliasbeads) up to date before running agent workflows that rely on task graphs. - Install or update with the official installer:
curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash- Verify install/version with
bd --version(orbeads --version). - Do not use
bd edit; use non-interactivebd create/update/close --jsonor stdin-based--description=-. - Typical flow:
bd ready --json,bd create ... --json,bd update <id> --claim,bd close <id> --reason "Done".
- Default load order: this file + one relevant skill + one relevant doc.
- Do not open broad catalogs (
docs/TASK_CATALOG.md, large script lists, full reports) unless required. - Prefer directory-local
AGENTS.md/CLAUDE.mdwhen working underscripts/,configs/,tasks/, ordocs/.
- Launch or rerun benchmarks:
docs/DAYTONA.md(Daytona, preferred) ordocs/START_HERE_BY_TASK.md - Monitor / status:
docs/START_HERE_BY_TASK.md-> "Monitor Active Runs" - Triage failures:
docs/START_HERE_BY_TASK.md-> "Triage Failed Tasks" - Compare configs / MCP impact / IR:
docs/START_HERE_BY_TASK.md-> "Analyze Results" - Repo policy / health gate:
docs/REPO_HEALTH.md,docs/ops/WORKFLOWS.md - Script discovery:
docs/ops/SCRIPT_INDEX.md
scripts/AGENTS.md- script categories, safe usage, one-off handlingconfigs/AGENTS.md- run launcher wrappers and confirmation gate policydocs/AGENTS.md- documentation IA and canonical vs archive guidance
- Compact after exploration, before multi-file edits.
- Compact after launching a benchmark batch.
- Compact after completing a triage batch or report generation pass.
- When handing work to a new session, use the generic
/handoffskill to generate an inline copy/paste handoff prompt. - Do not create a markdown handoff file unless the user explicitly asks for one.
- Use
docs/ops/HANDOFF_TEMPLATE.mdas a checklist for what the handoff should include.
- Track remaining follow-up in issues or beads.
- Run
python3 scripts/repo_health.py(or--quickfor docs/config-only changes). - Update issue/task status.
git pull --rebase && git push && git statusand confirmmainis up to date withorigin/main.- Clean up and hand off using
/handoffplusdocs/ops/HANDOFF_TEMPLATE.md. - Work is not complete until push succeeds.
docs/START_HERE_BY_TASK.md- task-based read orderdocs/ops/WORKFLOWS.md- operational workflow summariesdocs/ops/TROUBLESHOOTING.md- escalation and common failure routingdocs/ops/SCRIPT_INDEX.md- generated script registry indexdocs/reference/README.md- stable specs and reference docsdocs/explanations/README.md- rationale and context docs
- NEVER edit root
CLAUDE.mdorAGENTS.mddirectly. Edit canonical sources underdocs/ops/and regenerate. Direct edits causeagent_guides_driftfailures inrepo_health.py. - After removing directories from the repo, also clean references from
scripts/sync_agent_guides.py(LOCAL_SOURCES) andscripts/docs_consistency_check.py(LOCAL_AGENT_TARGET_DIRS).
- Daytona builds images from Dockerfiles at sandbox creation time (
Image.from_dockerfile()). Dockerfile fixes pushed tomaintake effect on the next run -- no manual image rebuild needed. Exception: pre-built GHCR base images must be rebuilt separately. - Harbor+Daytona (
harbor run --environment-type daytona) is the recommended production approach. The standalonescripts/daytona_runner.pyis for quick validation only. - Use
BASELINE_MCP_TYPEenv var to control MCP configuration:none,sourcegraph,deepsearch. - Daytona SDK (
daytona_sdk) over CLI for sandbox interaction -- the CLI is interactive-only for SSH. - GHCR packages default to private for personal accounts and visibility cannot be changed via API. Use the GitHub web UI or push to an org.
uv tool installsegfaults on ARM64/QEMU emulation. Usepip installinstead, or switch to Daytona (native x86_64).- Build-push-clean pattern when building Docker images with limited disk (~45GB): build one image, push, then clean locally before the next.
- Colons in agent names (e.g.,
module:ClassName) break Docker volume mounts. Sanitize paths: replace:with__.
.mcp.jsonmust be placed at$CLAUDE_CONFIG_DIR(typically/logs/agent/sessions/), not/app/or/root/.- Claude Code requires the
--mcp-configCLI flag to load MCP config -- it does not auto-detect. - Inject MCP usage instructions into the task prompt. Agents won't use MCP tools just because they're available.
- Set
NODE_TLS_REJECT_UNAUTHORIZED=0for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work). - Sourcegraph MCP uses stdio transport (
npx @sourcegraph/cody --stdio), NOT HTTP. HTTP 405 = correct endpoint, wrong protocol. - Sourcegraph skills show empty in headless mode. Embed skill prompt content in CLAUDE.md directly.
- Sourcegraph env vars:
SOURCEGRAPH_URLandSOURCEGRAPH_ACCESS_TOKEN(NOT_ENDPOINTor_TOKEN).
- Timing fields (
started_at,finished_at) live at the top level ofresult.json, not nested undertiming. trajectory.jsonis generated by Harbor's_convert_events_to_trajectory()post-processing, NOT by Claude Code CLI directly.- SWE-bench
test.shredirects stdout to a temp file -- Harbor never sees the parser'sSTART_TEST_OUTPUT/END_TEST_OUTPUTmarkers via its normal capture. - Token usage data lives in
trajectory.json; plain transcript parsers do not see it. - Harbor task contract requires writing
/logs/verifier/reward.txt.
validators.pyis duplicated acrossccb_buildtasks. Changes must be applied to all copies (verify withsha256sum).- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
- Agent completing in <2 seconds = agent never installed/ran (smoke test heuristic).
- Trial directory names are truncated with hash suffixes (e.g.,
c_api_graphql_expert_079_archite__pm9xcPn). The real task name lives inconfig.jsonattask.path. - LoCoBench task IDs contain multi-word fields (e.g.,
game_engine,cross_file_refactoring). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields. - no_changes_guard: Python sets
reward = 0.0but bashecho "$score"uses the original variable. Writereward.txtinside the Python block, not after it. - Wrap all test runners with
timeout 600. Add--forceExitto Jest. Indefinite hangs (>2h) observed without timeout. - Jest + TypeScript needs 4-6GB RAM. Set
memory_mb = 8192intask.tomlfor front-end test suites (default 2GB causes OOM). - CSB dual-score: agents produce file edits +
answer.json; scored independently. Fallback:promoted_verifier.py→oracle_checks.py→ heuristic. - Rate-limited results (score=0, duration <30s): quarantine with
scripts/quarantine_invalid_tasks.py --execute. - Bare
$VARininstruction.mdgets expanded. Use<placeholder>syntax.
gh auth refreshwithout-s <scope>is a no-op for adding scopes. Must usegh auth refresh -h github.com -s write:packagesexplicitly.- Environment variables must be explicitly exported for Harbor subprocesses. Use
set -abefore sourcing.env.local. - Account readiness tracked in
runs/state/account_health.json. Launchers sourceconfigs/_common.shand filter unsafe accounts. - GitHub push protection blocks synthetic API keys. Squash with
git reset --soft origin/main. - Shallow clones (
--depth 1) fail on push. Always use full clones for repos that will be pushed. - Some repos use
masteras default branch. Detect withgit symbolic-ref refs/remotes/origin/HEAD. - GitHub secret scanning blocks embedded secrets. Unblock via the
/security/secret-scanning/unblock-secret/URL.
dict.get(key, default)does NOT protect againstNonevalues. Usedata.get("key") or default_value.with open(log) as f: subprocess.Popen(stdout=f)closes the handle. Useopen()without context manager for long-running subprocesses.- macOS Bash 3.2 lacks
declare -A. Use pipe-delimited strings withIFS='|' read -r.
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
- Tool categorization order matters: check MCP prefix (
mcp__) before substring checks (e.g.,deep_search) to avoid miscategorization ofmcp__deep_search.
sandbox_pluginsis a list (not property). Strip ALL plugins (= []) --agent_skillsindexes/workspaceat startup (120s timeout on large repos). TOML config has no effect in v1.4.0.shlex.quote()breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.- Background daemons outlive the main process and hang Daytona poll. Wrap with
pkillcleanup; guard withshutil.which('pkill')(missing on minimal images). - Alpine lacks
apt-get(OH installer requirement). Usebookwormvariants. - OH MCP client has ~30s timeout. Block
deepsearch/deepsearch_readin auth proxy; redirect tokeyword_search/nls_search. chown -R /workspaceblocks port binding >120s on large repos. Edit installedruntime_init.pysource -- monkey-patches don't propagate to action_execution_server subprocess.- Set
PYTHONSAFEPATH=1to prevent repo-local packages from shadowing installed deps.
- Secret-detection hooks false-positive on code that detects secrets. Use
--no-verifywhen flagged code is detection logic. - Classes named
TestPlan/TestCase/TestResultget auto-collected by pytest. Rename toEvaluationPlanetc. - Ralph sessions write learnings to
progress.txton feature branches, not main. Compound back after merge.
- Root and local
AGENTS.md/CLAUDE.mdfiles are generated from sources indocs/ops/. docs/START_HERE_BY_TASK.mdis generated fromdocs/ops/task_routes.json.- Regenerate after edits (single command):
python3 scripts/refresh_agent_navigation.py