|
68 | 68 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:24Z","event_type":"created","id":68,"issue_id":"CodeScaleBench-iv9","new_value":"","old_value":""} |
69 | 69 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:26Z","event_type":"created","id":69,"issue_id":"CodeScaleBench-6or","new_value":"","old_value":""} |
70 | 70 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:40:35Z","event_type":"updated","id":70,"issue_id":"CodeScaleBench-zrs","new_value":"{\"notes\":\"Suite merge map: security(39)=sdlc_secure+org_security+org_compliance | debug(26)=sdlc_debug+org_incident | fix(19)=sdlc_fix | feature(34)=sdlc_feature+org_org | refactor(43)=sdlc_refactor+org_migration | understand(44)=sdlc_understand+sdlc_design+org_domain+org_onboarding | document(11)=sdlc_document | test(12)=sdlc_test | crossrepo(47)=org_crossrepo+org_crossrepo_tracing+org_crossorg+org_platform\"}","old_value":"{\"id\":\"CodeScaleBench-zrs\",\"title\":\"Unified dual-score benchmark: agent always produces both direct edits and answer.json\",\"description\":\"Epic: Every task run yields two independent scores (reward_direct from file edits, reward_artifact from answer.json). No mode switching — agent always does both. Requires changes to agent instructions, verifier infrastructure, result extraction, and all 275 task verifiers.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"feature\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:15:58Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:15:58Z\"}"} |
| 71 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:40:52Z","event_type":"created","id":71,"issue_id":"CodeScaleBench-5hc","new_value":"","old_value":""} |
| 72 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:41:02Z","event_type":"status_changed","id":72,"issue_id":"CodeScaleBench-5hc","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-5hc\",\"title\":\"Scaffold benchmarks/csb/ with merged suites and copy all 275 tasks\",\"description\":\"Create benchmarks/csb/{security,debug,fix,feature,refactor,understand,document,test,crossrepo}/ directories. Copy all 275 tasks from existing csb_sdlc_* and csb_org_* into merged suites per the merge map. Update task.toml in each copied task to reflect new suite. Do NOT modify original benchmarks/ dirs. Merge map: security=sdlc_secure+org_security+org_compliance, debug=sdlc_debug+org_incident, fix=sdlc_fix, feature=sdlc_feature+org_org, refactor=sdlc_refactor+org_migration, understand=sdlc_understand+sdlc_design+org_domain+org_onboarding, document=sdlc_document, test=sdlc_test, crossrepo=org_crossrepo+org_crossrepo_tracing+org_crossorg+org_platform\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:40:52Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:40:52Z\"}"} |
| 73 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:42:27Z","event_type":"closed","id":73,"issue_id":"CodeScaleBench-5hc","new_value":"275 tasks scaffolded into benchmarks/csb/ across 9 merged suites. origin_suite tracked in task.toml. Zero missing files.","old_value":""} |
| 74 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:43:02Z","event_type":"status_changed","id":74,"issue_id":"CodeScaleBench-44x","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-44x\",\"title\":\"Agent instructions: always produce both direct edits AND answer.json\",\"description\":\"Modify claude_baseline_agent.py so ALL mcp_types tell the agent to: (1) edit files directly to solve the task, AND (2) also produce /workspace/answer.json summarizing work. Remove the either/or branching between artifact_full and direct modes. The answer.json instruction should be appended to every workflow_tail, not just artifact_full.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:18:09Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:18:09Z\"}"} |
| 75 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:43:02Z","event_type":"status_changed","id":75,"issue_id":"CodeScaleBench-izn","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-izn\",\"title\":\"Dual-score verifier infrastructure: write reward_direct.txt and reward_artifact.txt\",\"description\":\"Create a shared verifier library (dual_score_lib.sh) that every test.sh sources. After the existing verifier runs and writes reward.txt (direct score), the lib: (1) parses /workspace/answer.json via answer_json_verifier_lib.sh, (2) scores it independently using the same oracle/checklist logic, (3) writes /logs/verifier/reward_direct.txt and /logs/verifier/reward_artifact.txt. Keep reward.txt as-is for backward compat (composite or direct score). Also update validation_result.json schema to include both sub-scores.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:18:13Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:18:13Z\"}"} |
| 76 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:46:24Z","event_type":"closed","id":76,"issue_id":"CodeScaleBench-44x","new_value":"Merged workflow_tail in claude_baseline_agent.py — all mcp_types now tell agent to edit files AND produce answer.json. Added OUTPUT ARTIFACT section to EVALUATION_CONTEXT_PROMPT for baseline (none) mode.","old_value":""} |
| 77 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:46:24Z","event_type":"closed","id":77,"issue_id":"CodeScaleBench-izn","new_value":"Created scripts/dual_score_lib.sh — sources at end of test.sh/eval.sh, captures reward_direct from existing reward.txt, independently scores answer.json as reward_artifact, writes both to /logs/verifier/.","old_value":""} |
| 78 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:46:25Z","event_type":"closed","id":78,"issue_id":"CodeScaleBench-6cv","new_value":"All 131 SDLC tasks in benchmarks/csb/ integrated via integrate_dual_score.py. 123 already had answer_json_verifier_lib, 8 skip tasks got it added. All 131 have dual_score_lib.sh sourced at end of test.sh.","old_value":""} |
| 79 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:46:25Z","event_type":"closed","id":79,"issue_id":"CodeScaleBench-csg","new_value":"All 144 Org tasks in benchmarks/csb/ integrated. 81 eval.sh tasks get dual_score appended to eval.sh. 55 promoted_verifier tasks get it in test.sh. 8 onboard-search tasks get it in test.sh. All have dual_score_lib.sh and answer_json_verifier_lib.sh in tests/.","old_value":""} |
| 80 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:46:30Z","event_type":"status_changed","id":80,"issue_id":"CodeScaleBench-iv9","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-iv9\",\"title\":\"Extend result.json and extraction pipeline for dual scores\",\"description\":\"Update Harbor result capture and extraction pipeline: (1) result.json gets verifier_result.rewards.reward_direct and .reward_artifact alongside existing .reward; (2) extract_v2_report_data.py reads both new fields; (3) task_metrics.json includes both; (4) aggregate_status.py extended; (5) promote_run.py validates both scores present. Keep reward as composite (mean of direct+artifact) for backward compat.\",\"status\":\"open\",\"priority\":2,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:18:24Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:18:24Z\"}"} |
| 81 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:47:57Z","event_type":"closed","id":81,"issue_id":"CodeScaleBench-iv9","new_value":"Extended extract_v2_report_data.py to read reward_direct and reward_artifact from validation_result.json and result.json. Added dual_rewards to paired stats (bl_reward_direct, mcp_reward_direct, bl_reward_artifact, mcp_reward_artifact). Added _extract_dual_rewards() to aggregate_status.py. promote_run.py needs no changes (additive fields). Backward compatible: pre-dual runs default reward_direct=reward, reward_artifact=None.","old_value":""} |
| 82 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:48:02Z","event_type":"status_changed","id":82,"issue_id":"CodeScaleBench-6or","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-6or\",\"title\":\"Dual-score reporting: paired stats and breakdowns for both dimensions\",\"description\":\"Extend reporting to show both score dimensions: (1) compute_paired_stats produces bl_reward_direct, mcp_reward_direct, delta_direct (and same for artifact); (2) breakdown_by generates per-language, per-difficulty, per-suite stats for each dimension; (3) Add correlation analysis between direct and artifact scores (do agents that edit well also describe well?). Output unified report with both dimensions.\",\"status\":\"open\",\"priority\":3,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:18:27Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:18:27Z\"}"} |
| 83 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:49:07Z","event_type":"closed","id":83,"issue_id":"CodeScaleBench-6or","new_value":"Added DUAL-SCORE ANALYSIS and DUAL-SCORE BY SUITE sections to extract_v2_report_data.py output. Shows direct vs artifact means, gap, and Pearson correlation. breakdown_by() now includes per-dimension stats (bl_mean_direct, mcp_mean_direct, delta_direct, etc.) when data available.","old_value":""} |
| 84 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:49:07Z","event_type":"closed","id":84,"issue_id":"CodeScaleBench-zrs","new_value":"Epic complete. 275 tasks in benchmarks/csb/ across 9 merged suites, all with dual-score verifiers. Agent instructions updated to always produce both direct edits and answer.json. Extraction and reporting pipelines extended for dual scores.","old_value":""} |
0 commit comments