Skip to content

Audit logging coverage and diagnostics#33

Merged
orban merged 19 commits intomainfrom
nightshift/logging-audit-fix2
Apr 15, 2026
Merged

Audit logging coverage and diagnostics#33
orban merged 19 commits intomainfrom
nightshift/logging-audit-fix2

Conversation

@orban
Copy link
Copy Markdown
Owner

@orban orban commented Apr 4, 2026

Summary

  • preserve stable hook telemetry TSV rows by escaping control characters and decoding them in the dashboard
  • emit explicit start/finish generation markers on eval-harness cache hits and cover them with regression tests
  • document the revised logging contract and verify the focused hook and task-runner suites

Testing

  • bash tests/test_feedback_loop.sh
  • bash tests/test_telemetry.sh
  • bash tests/test_hooks.sh
  • bash tests/test_stop_hook.sh
  • python -m pytest eval-harness/tests/test_task_runner.py

orban added 19 commits February 23, 2026 20:30
…to eval harness

- store run_config in checkpoints; warn on resume if config mismatches
- classify errors as infra/timeout/genuine; skip genuine failures on resume
- add --retry-all flag to override and re-run genuine failures too
- write per-trial JSON to results/trials/ for ls-level observability
- 13 new tests covering all three features

Entire-Checkpoint: 7fb98ee370d1
Entire-Checkpoint: 5384ed119f2a
Entire-Checkpoint: 81278d45e27d
Pre-validation failures trip at task level (all conditions share
Docker setup), other failures trip at task+condition level.
Threshold of 2 accounts for in-flight parallel workers.

Entire-Checkpoint: b2252e9cb3d3
After each batch, classifies failures as infra vs genuine. If infra
failures detected: checks Docker health, restarts if needed, reduces
parallelism, resets circuit breaker, and retries. Max 2 retry rounds.

Entire-Checkpoint: 806ec810be18
Status file (.eval-status.json) updated after each result with
machine-readable state: workers, pass/fail rates, paused flag.
Control dir (.eval-control/) accepts commands: pause, resume,
set-workers N, skip-task <id>. Commands consumed on read.
Enables Ralph Loop or any external agent to manage running evals.

Entire-Checkpoint: 5f55311d2e2a
- Wire fisher_exact_test into reporter for per-task significance testing
- Add _compute_recommendations() flagging ceiling/floor/infra-only tasks
- Add Per-Task Analysis table and Recommendations section to markdown output
- Add lib/monitor.py: polling-based eval supervisor with stall detection,
  Docker recovery, infra-task skipping, and worker scaling
- 79 new tests across stats, reporter, and monitor modules

Entire-Checkpoint: 722abb53d525
- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval)
- run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer)
- Dynamic condition discovery in reporter (no more hardcoded condition list)
- Path traversal protection in write_test_infrastructure
- Docker --network none for test isolation
- Thread-safe temp dirs (PID + thread ID)
- Checkpoint batching (every 10 results vs every 1)
- Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift)
- Set-based infra result filtering (replaces fragile list.remove)
- Empty dict all() guard in pre-validation

Entire-Checkpoint: 815f2403cb52
--network none breaks setup commands that need pip install.
Default to bridge (Docker's default) and expose the parameter
so callers can opt into none for pure-test phases later.

Entire-Checkpoint: 8118f4f3017a
- agentbench_loader: use split="train" (dataset has no "test" split)
- docker_runner: use bash instead of sh (AGENTbench setup uses `source`)
Entire-Checkpoint: 0afc9c1d9169
AGENTbench images install tools like uv to /root/.local/bin which
only gets added to PATH via /etc/profile in login shells. bash -lc
instead of bash -c fixes exit 127 for repos using uv.

Entire-Checkpoint: 5d8c2af91c5d
…ote docker support

Three fixes from run 4 analysis:

- Regression eval: only fail when a golden-passing test now fails (was
  requiring 100% pass rate, which is impossible when repos have 14-83
  pre-existing failures in the baseline)
- strip_docs: preserve README.md variants (setup.py reads them)
- docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution
  via rsync, avoiding QEMU emulation on Apple Silicon

Also removes .index-cache-preserve/ (context files now generated
dynamically per run).

Entire-Checkpoint: 9f4d4a6248d7
UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync
connections and rejects paths outside configured modules. tar piped
through SSH bypasses this entirely and works reliably.

Entire-Checkpoint: e86fbbd52da2
The pre-pull step was running `docker pull` locally even when Docker
execution happens on a remote host via SSH. Now uses `ssh $host docker
pull` when EVAL_DOCKER_HOST is configured.

Entire-Checkpoint: cfe3bf0bccca
Prevents two issues from overnight eval runs:
- SSH agent key expiry caused all workers to hang indefinitely on
  stale connections. Added ConnectTimeout, ServerAliveInterval, and
  subprocess timeout=300s so failures surface within 5 minutes.
- sync_from_remote was transferring .venv dirs (4GB+) back from
  chronos. Added excludes for .venv, node_modules, __pycache__, etc.

Entire-Checkpoint: aa17a443b57d
Replace ephemeral docker run with persistent containers (docker run -d +
docker exec) so setup runs once per task instead of 3x. Add
start_container, exec_in_container, stop_container, copy_into_container
to docker_runner.py.

Fix git "dubious ownership" error caused by macOS tar overlay changing
file UIDs inside containers (CVE-2022-24765). Add safe.directory config
before overlay. Include stderr/stdout tail in setup error messages.

Entire-Checkpoint: c11a4a4ced3d
Nightshift-Task: logging-audit
Nightshift-Ref: https://github.com/marcus/nightshift
@orban orban merged commit 405ff2a into main Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant