|
16 | 16 | {"acceptance_criteria":"The readiness path for the harnesses actually in scope is documented and verified; SOURCEGRAPH_ACCESS_TOKEN is confirmed to load from .env.local for operator shells or launcher wrappers; Gemini is explicitly excluded from the immediate rerun gate; the exact commands to gate and launch the pending reruns are recorded in the issue notes or description.","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"All harnesses pass readiness checks. SG token confirmed from .env.local. Gemini excluded from immediate gate.","closed_at":"2026-03-09T20:23:25Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"7502ea94c9d230e14d139c67ca911010befb98b8cf4caa83a9a9a9710d47d945","created_at":"2026-03-09T20:19:06Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Track the operational gating work needed before treating the pending reruns as ready for harness-agnostic CI or launch checks.\n\nScope:\n- Treat active harnesses separately from the full registry-wide check when Gemini is not in scope.\n- Confirm SOURCEGRAPH_ACCESS_TOKEN is sourced from .env.local (or equivalent launcher path) before running readiness checks.\n- Validate the relevant readiness commands for the immediate rerun work, such as:\n - python3 scripts/check_harness_readiness.py --harness codex --format json\n - equivalent checks for other active harnesses as needed\n- Confirm the previously failed rerun workflow can be gated without requiring unrelated harness credentials.\n- Document any remaining blocker as either env setup, launcher bug, or harness-specific requirement.\n\nThis is separate from task-contract migration work and separate from the historical rerun execution/classification task already tracked in Beads.\n","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-rm3","is_template":0,"issue_type":"task","last_activity":null,"metadata":"{}","mol_type":"","notes":"Verified: all harnesses pass readiness (codex, cursor, gemini, copilot, openhands). SG token loads from .env.local (61 chars). Gemini passes but is out of scope for immediate reruns.","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Validate active-harness CI gating before pending rerun batches","updated_at":"2026-03-09T20:23:25Z","waiters":"","wisp_type":"","work_type":""} |
17 | 17 | {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Built unified 280-task manifest (schema v2.0). comprehension=100, implementation=90, quality=90. Overall power=84.1% at sigma=0.20. Large codebase 58.6%, multi-repo 31.8%, 20 suites, 11 languages. LOC fallback chain eliminates all unknowns.","closed_at":"2026-03-07T23:33:05Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"e464c7d5aa11f02b2eac40dc12bfbee707add98b6882dc3f11c7d9410edd7b71","created_at":"2026-03-07T22:56:46Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Rebuild the core benchmark manifest as a single unified set (no SDLC vs Org split). Optimize selection for: (1) 80% power for overall retrieval effect, (2) balanced task-type representation (comprehension/implementation/quality), (3) multi-repo coverage in every task type, (4) LOC band diversity with emphasis on large codebases (2M+ LOC). Target ~280-300 tasks based on power analysis. Every task has both deterministic reward and IR retrieval scoring.","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-utv","is_template":0,"issue_type":"task","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":3,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Rebuild unified manifest with power-optimized task-type balance","updated_at":"2026-03-07T23:33:05Z","waiters":"","wisp_type":"","work_type":""} |
18 | 18 | {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Epic complete: (1) IR scoring added to SDLC tasks (ggy), (2) 67 Org tasks got deterministic verifiers (c17), (3) unified 280-task manifest built (utv). No more SDLC/Org split.","closed_at":"2026-03-07T23:33:07Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"e3d9bf86e6f520ab604c0c7d317b708e8814f4e5505b5d360caf4591b3428e2d","created_at":"2026-03-07T22:56:15Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Converge the two halves of CodeScaleBench (SDLC with deterministic verifiers + Org with answer.json verifiers) into a single unified benchmark. Three phases: (1) add IR scoring to SDLC tasks via curator ground truth, (2) promote select Org tasks to SDLC categories with deterministic verifiers, (3) rebuild manifest optimized for multi-repo, large codebase, and task-type balance (comprehension/implementation/quality).","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-xjg","is_template":0,"issue_type":"feature","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"[Epic] Unify SDLC + Org into single balanced benchmark","updated_at":"2026-03-07T23:33:07Z","waiters":"","wisp_type":"","work_type":""} |
19 | | -{"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"","closed_at":null,"closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"ac89868978b54a6008a99a151b8f278d8fdc393d23b13578f18cb1bd62db75e7","created_at":"2026-03-10T11:27:18Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Three distinct infra failures need fixing before rerunning OH verification tasks:\n\n1. Harbor FileNotFoundError: django-select-for-update agent ran successfully (614 lines output, 0 crashes) but Harbor crashed writing command-2/return-code.txt. Likely Daytona sandbox cleanup race in ccb_harbor.daytona:GuardedDaytonaEnvironment.\n\n2. DinD build failure: bustub-hyperloglog baseline (Claude Haiku sentinel, csb_sdlc_feature_haiku_20260309_223654) — DinD build never completed, no task-level result dir created.\n\n3. MCP 6.5hr exception: bustub-hyperloglog MCP (same sentinel run) — ran 6.5 hours then exception_raised. flagged.json shows deepsearch_unused + only 7.86% MCP ratio.\n\nAfter fixing these, rerun all 12 tasks using configs/oh_full_rerun_20260310.json. The 9 original verification subset tasks crashed due to jupyter/fget bugs (now fixed in d0fab95). The 3 extra tasks (compliance-124, agentic-122, django-select-for-update) also need rerun. Note: 3 tasks are csb_org_* — verify OH launcher handles org tasks (prior rerun silently skipped them).\n\nAlso audit official runs for false positives from the no_changes_guard verifier bug (fixed in c5f261f):\n grep -rl no_changes_guard runs/official/*/validation_result.json\n\nTainted runs (do NOT promote): openhands_sonnet46_20260309_{210054,223658,232947}","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-yb4","is_template":0,"issue_type":"bug","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":2,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"open","target":"","timeout_ns":0,"title":"Investigate OH/Harbor infrastructure failures before rerun","updated_at":"2026-03-10T11:27:18Z","waiters":"","wisp_type":"","work_type":""} |
| 19 | +{"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"","closed_at":null,"closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"ac89868978b54a6008a99a151b8f278d8fdc393d23b13578f18cb1bd62db75e7","created_at":"2026-03-10T11:27:18Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Three distinct infra failures need fixing before rerunning OH verification tasks:\n\n1. Harbor FileNotFoundError: django-select-for-update agent ran successfully (614 lines output, 0 crashes) but Harbor crashed writing command-2/return-code.txt. Likely Daytona sandbox cleanup race in ccb_harbor.daytona:GuardedDaytonaEnvironment.\n\n2. DinD build failure: bustub-hyperloglog baseline (Claude Haiku sentinel, csb_sdlc_feature_haiku_20260309_223654) — DinD build never completed, no task-level result dir created.\n\n3. MCP 6.5hr exception: bustub-hyperloglog MCP (same sentinel run) — ran 6.5 hours then exception_raised. flagged.json shows deepsearch_unused + only 7.86% MCP ratio.\n\nAfter fixing these, rerun all 12 tasks using configs/oh_full_rerun_20260310.json. The 9 original verification subset tasks crashed due to jupyter/fget bugs (now fixed in d0fab95). The 3 extra tasks (compliance-124, agentic-122, django-select-for-update) also need rerun. Note: 3 tasks are csb_org_* — verify OH launcher handles org tasks (prior rerun silently skipped them).\n\nAlso audit official runs for false positives from the no_changes_guard verifier bug (fixed in c5f261f):\n grep -rl no_changes_guard runs/official/*/validation_result.json\n\nTainted runs (do NOT promote): openhands_sonnet46_20260309_{210054,223658,232947}","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-yb4","is_template":0,"issue_type":"bug","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":2,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"in_progress","target":"","timeout_ns":0,"title":"Investigate OH/Harbor infrastructure failures before rerun","updated_at":"2026-03-10T12:13:47Z","waiters":"","wisp_type":"","work_type":""} |
0 commit comments