Merged
Conversation
…to eval harness - store run_config in checkpoints; warn on resume if config mismatches - classify errors as infra/timeout/genuine; skip genuine failures on resume - add --retry-all flag to override and re-run genuine failures too - write per-trial JSON to results/trials/ for ls-level observability - 13 new tests covering all three features Entire-Checkpoint: 7fb98ee370d1
Entire-Checkpoint: 5384ed119f2a
Entire-Checkpoint: 81278d45e27d
Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3
After each batch, classifies failures as infra vs genuine. If infra failures detected: checks Docker health, restarts if needed, reduces parallelism, resets circuit breaker, and retries. Max 2 retry rounds. Entire-Checkpoint: 806ec810be18
Status file (.eval-status.json) updated after each result with machine-readable state: workers, pass/fail rates, paused flag. Control dir (.eval-control/) accepts commands: pause, resume, set-workers N, skip-task <id>. Commands consumed on read. Enables Ralph Loop or any external agent to manage running evals. Entire-Checkpoint: 5f55311d2e2a
- Wire fisher_exact_test into reporter for per-task significance testing - Add _compute_recommendations() flagging ceiling/floor/infra-only tasks - Add Per-Task Analysis table and Recommendations section to markdown output - Add lib/monitor.py: polling-based eval supervisor with stall detection, Docker recovery, infra-task skipping, and worker scaling - 79 new tests across stats, reporter, and monitor modules Entire-Checkpoint: 722abb53d525
- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval) - run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer) - Dynamic condition discovery in reporter (no more hardcoded condition list) - Path traversal protection in write_test_infrastructure - Docker --network none for test isolation - Thread-safe temp dirs (PID + thread ID) - Checkpoint batching (every 10 results vs every 1) - Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift) - Set-based infra result filtering (replaces fragile list.remove) - Empty dict all() guard in pre-validation Entire-Checkpoint: 815f2403cb52
Entire-Checkpoint: 94115d70169e
--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a
- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169
AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d
…ote docker support Three fixes from run 4 analysis: - Regression eval: only fail when a golden-passing test now fails (was requiring 100% pass rate, which is impossible when repos have 14-83 pre-existing failures in the baseline) - strip_docs: preserve README.md variants (setup.py reads them) - docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution via rsync, avoiding QEMU emulation on Apple Silicon Also removes .index-cache-preserve/ (context files now generated dynamically per run). Entire-Checkpoint: 9f4d4a6248d7
UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync connections and rejects paths outside configured modules. tar piped through SSH bypasses this entirely and works reliably. Entire-Checkpoint: e86fbbd52da2
The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca
Prevents two issues from overnight eval runs: - SSH agent key expiry caused all workers to hang indefinitely on stale connections. Added ConnectTimeout, ServerAliveInterval, and subprocess timeout=300s so failures surface within 5 minutes. - sync_from_remote was transferring .venv dirs (4GB+) back from chronos. Added excludes for .venv, node_modules, __pycache__, etc. Entire-Checkpoint: aa17a443b57d
Replace ephemeral docker run with persistent containers (docker run -d + docker exec) so setup runs once per task instead of 3x. Add start_container, exec_in_container, stop_container, copy_into_container to docker_runner.py. Fix git "dubious ownership" error caused by macOS tar overlay changing file UIDs inside containers (CVE-2022-24765). Add safe.directory config before overlay. Include stderr/stdout tail in setup error messages. Entire-Checkpoint: c11a4a4ced3d
Nightshift-Task: logging-audit Nightshift-Ref: https://github.com/marcus/nightshift
Nightshift-Task: logging-audit Nightshift-Ref: https://github.com/marcus/nightshift
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing