Several cases have questionable ground truth that inflates or deflates scores:
- leo-001: "Bug" is missing code (feature gap), not reviewable wrong code
- leo-013: Buggy lines point to comments, not executable code
- leo-029: Release version bump — not a reviewable bug
- leo-074: Blame traced to module re-exports from 3 years ago, not actual logic bug
- leo-091: Accumulated technical debt in Display formatting, not a discrete defect
- leo-136: Scorer marks "caught" when tools comment on right file but wrong issue
- Copilot: Best at line-specific detection; 60% catch rate
- Greptile: Strong summaries with file:line references; 45% catch rate
- Sonnet diff-only: Highest review quality (2.09); finds most novel issues
- Opus diff-only: Highest detection precision on caught cases
- Classic Rust anti-patterns: blocking in async, wrong variable, type mismatches
- Copy-paste errors: Wrong string literal, wrong enum variant
- Missing error handling: Unwrapped Results, panics replacing errors
- External API changes: Bug requires knowing a dependency's API changed
- Missing code: Bug is absence of functionality, not wrong code
- Domain-specific: Cryptographic nonce security, ZK circuit correctness
- Large refactors: 20+ files changed, tools overwhelmed by volume
caught=Truewithdet=0appears in several cases — mechanical scorer says hit but judge says wrong issue. Scorer is too permissive on file-level matches.- CodeRabbit 3.4% catch is accurate even with LLM judge — its summaries genuinely dont identify specific bugs.
14 active cases. Tools: Copilot (PR), Greptile (PR), CodeRabbit (PR), Sonnet diff-only (run-01), Opus diff-only (run-04).
Notation: X = caught (comment near buggy lines), . = missed. Det/Qual = detection_score/review_quality (0-3 / 0-4).
Ground truth: crates/fmt/tests/harness.rs:143-176 -- Missing test infrastructure code (AST equivalence validation, type-checking validation, external repo tests, helper functions).
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 0 | 1 | Nitpicked tested counter placement (FP); missed actual missing infrastructure |
| Greptile | X | 0 | 1 | Low-value doc-comment path nit; missed missing validation code entirely |
| CodeRabbit | . | 0 | 0 | No comments produced |
| Sonnet diff | . | 0 | 1 | Incorrect claim about use super::* not importing private items (FP) |
| Opus diff | . | 0 | 0 | No comments produced |
Analysis: No tool detected the actual bug. The ground truth here is unusual -- the "bug" is an absence of substantial test infrastructure that should have been added in the introducing PR. This is essentially missing code rather than incorrect code, making it extremely hard for any tool to flag from a diff review. The buggy lines span 30+ lines of code that simply don't exist yet. This case tests whether tools can identify that a refactoring PR failed to carry over important functionality. Ground truth validity: questionable for code review detection -- this is more of a feature gap than a reviewable bug in the introducing diff.
Ground truth: errors/src/errors/type_checker/type_checker_error.rs:1181-1183 -- The invalid_operation_inside_final_block error has an unclear message that should provide guidance about capturing values in local variables.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | . | 0 | 2 | Found 7 real grammar/formatting issues in error messages (all TP-novel) |
| Greptile | . | 0 | 0 | No comments at all |
| CodeRabbit | . | 0 | 2 | 2 novel findings (DCE dropping FinalFn, missing type-checking guards) + 1 FP + 4 low-value |
| Sonnet diff | . | 0 | 3 | 11 novel findings: caught panic!() replacing emit_err() across refactoring + grammar errors |
| Opus diff | . | 0 | 2 | 10 novel findings: same panic! vs unreachable!() pattern, grammar issues |
Analysis: No tool caught the target bug, but every tool that produced output found genuine novel issues in the diff. Sonnet diff-only stood out with 11 TP-novel findings at quality=3. The bug is about improving an error message's clarity -- a UX concern that's hard to flag mechanically since the existing message isn't technically wrong, just unhelpful. The PR is a large vocabulary refactoring with many secondary issues, which drew all tools' attention away from the specific error message that needed improvement. Ground truth is valid but represents a subtle "error message quality" bug that tools reasonably missed.
Ground truth: crates/leo/src/cli/commands/devnet/mod.rs:352-639 -- Missing --storage/--path arguments and per-node storage directories needed for snarkOS compatibility.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 1 | 3 | 5 TP-novel: touched cleanup area (det=1), plus real finds (data corruption risk, u16 overflow, unchecked HTTP) |
| Greptile | X | 0 | 2 | 1 TP-novel (exit code concern), 3 low-value u16 overflow nits |
| CodeRabbit | . | 1 | 2 | 7 TP-novel + 2 FP; many tangential issues (shell escaping, port overflow) but missed core bug |
| Sonnet diff | . | 0 | 2 | 2 TP-novel (Darwin arch, silenced HTTP); mostly speculative edge cases |
| Opus diff | X | 0 | 2 | 3 TP-novel (u16 truncation, Darwin arch, CI config); missed actual --storage/--path issue |
Analysis: Copilot came closest (det=1) by touching the cleanup area, but no tool identified the root cause: missing --storage and --path CLI arguments for snarkOS compatibility. This is an external API compatibility bug -- the introducing PR built against one snarkOS version, and the fix adapts to a newer release. Tools can't detect this without knowing the snarkOS API changed. All tools instead found secondary issues (u16 overflow, HTTP error handling, etc.) which are legitimate but tangential. Ground truth is valid but requires external knowledge.
Ground truth: leo/cli/commands/devnode/advance.rs:40-63 and related -- Blocking reqwest client in async context, silently discarded HTTP responses, missing error propagation.
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 3 | 3 | Direct hit: blocking reqwest in async fn, discarded .send() result. 1 TP + 10 TP-novel |
| Greptile | X | 3 | 3 | Direct hit: blocking client will panic in async runtime. 1 TP + 3 TP-novel |
| CodeRabbit | X | 2 | 3 | Found blocking client + missing timeout. 1 TP + 9 TP-novel + 1 FP |
| Sonnet diff | X | 2 | 4 | 3 TPs on buggy lines (blocking client, ignored response, fragile rate-limit). 8 TP-novel. Exceptional quality |
| Opus diff | X | 3 | 3 | 2 TPs (blocking in async, discarded response). 5 TP-novel. Correct fix suggested |
Analysis: Best case in the group -- all 5 tools caught the bug. The pattern (blocking HTTP client inside async fn + ignored response) is a well-known Rust anti-pattern that LLMs and static analyzers can easily detect. Sonnet diff-only achieved the highest quality (4) with 3 direct TPs and 8 novel findings. Despite blame confidence C, the ground truth is clearly valid -- every tool independently identified the same core issues. This case demonstrates that obvious API misuse bugs are reliably caught across all tools.
Ground truth: leo-fmt/src/format.rs:1458-1476 and many other locations -- Missing line-wrapping support for binary operator chains exceeding 100 characters.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 0 | 2 | 7 TP-novel: found real issues (missing newlines after block comments, dropped COLON_COLON tokens) but missed wrapping |
| Greptile | . | 0 | 1 | 1 TP-novel (removed test_parse_safety), 1 low-value |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | X | 1 | 2 | 4 FP + 1 TP-novel; touched format_binary area but diagnosed wrong issue (DOT_DOT missing) |
| Opus diff | . | 1 | 2 | 2 TP-novel (duplicate trailing comments, dropped operator tokens); flagged right function but wrong issue |
Analysis: No tool identified the actual missing feature: binary op chain wrapping at 100 chars. This is a feature-level bug -- the formatter produces syntactically valid but poorly formatted output. Tools can't detect this without understanding the formatter's design goals. Both Sonnet and Opus touched the format_binary function (det=1) but diagnosed different issues. Copilot found the most novel issues (7) in the same code area. Ground truth is valid but represents a design-level omission that's extremely hard to catch from diff review alone.
Ground truth: Multiple: tests/expectations/cli/test_add/contents/build/imports/credits.aleo:53-96 (test expectations), deploy.rs/execute.rs/upgrade.rs (endpoint version-stripping regex).
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 1 | 2 | Flagged regex code duplication (right area, wrong diagnosis). 3 TP-novel |
| Greptile | X | 2 | 2 | Identified problematic regex in deploy.rs (right location). Suggested wrong fix (refine regex vs remove it) |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | X | 2 | 3 | 6 TPs on regex lines + 4 TP-novel. Found right location, suggested improving regex rather than removing |
| Opus diff | X | 1 | 2 | 6 FPs: found exact right location but fundamentally misdiagnosed (suggested refining regex instead of removing) |
Analysis: Tools consistently found the version-stripping regex but misdiagnosed the fix direction. The actual fix is to remove the stripping entirely (VM no longer expects bare endpoints), but tools suggested improving the regex pattern. This is a classic "found the code, missed the intent" situation. Sonnet diff scored highest (det=2, qual=3) with the most TPs. Opus diff is notable as a cautionary tale: it found the exact right lines (6 comments) but every one was scored FP because the suggested fix was backwards. Scorer may be too strict here -- locating buggy code and noting it's problematic should count even if the fix direction is wrong.
Ground truth: compiler/passes/src/function_inlining/program.rs:32-36 -- Unfiltered post_order() call in function inlining includes external program functions that pollute ordering.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | . | 0 | 1 | 1 FP: meta-concern about PR description vs changes (not a code defect) |
| Greptile | . | 1 | 2 | Summary correctly described the fix but no comment flagged the buggy lines directly |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | . | 0 | 0 | No comments at all |
| Opus diff | . | 0 | 0 | No comments at all |
Analysis: Total miss across all tools. This is a subtle compiler pass ordering bug where external program functions pollute the call graph traversal. The bug is in 5 lines that call .post_order() without filtering -- understanding why this is wrong requires deep knowledge of the compiler's function inlining architecture. Both diff-only agents produced zero comments, suggesting the diff was too small or domain-specific to trigger findings. Greptile's summary showed it understood the change but couldn't surface it as an actionable finding. Ground truth is valid -- this is a genuinely hard-to-catch compiler bug.
Ground truth: Multiple files: compiler/ast/src/passes/visitor.rs:343-375 (storage variable visitor), compiler/passes/src/type_checking/visitor.rs:309-345 (Get/Set type checking for vectors/mappings).
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | . | 0 | 1 | 2 FPs: speculative swap_remove and storage classification concerns |
| Greptile | X | 0 | 2 | 3 TP-novel (missing VectorClear guard, swap_remove analysis, nested vector rejection) |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | . | 0 | 0 | No comments at all |
| Opus diff | . | 0 | 2 | 2 TP-novel (missing VectorClear check, new_mappings scope leak) |
Analysis: No tool caught the actual bugs (SSA incorrectly pathing external global variables, type inference for get_or_use with unsuffixed numerics). The SSA bug requires understanding how the compiler resolves variable names across program boundaries -- deep domain knowledge. Interestingly, both Greptile and Opus independently found a missing check_access_allowed guard for VectorClear, suggesting this is a real novel issue. Ground truth is valid but the bugs are deep compiler semantics that require understanding Leo's SSA pass and type inference system.
Ground truth: leo/cli/commands/build.rs:108,112 -- Logic for distinguishing network/local .aleo dependencies from local Leo dependencies.
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | . | 0 | 2 | 3 TP-novel: name-consistency check, confusing error path, semver-breaking API removal |
| Greptile | . | 0 | 2 | 1 TP-novel: validation gap in from_aleo_path_impl |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | . | 0 | 2 | 2 TP-novel + 1 FP: found real issues in dependency resolution but missed build.rs |
| Opus diff | . | 0 | 2 | 1 TP-novel: edge case in .aleo path handling |
Analysis: Total miss on the target bug across all tools, but every tool that produced output achieved quality=2 with novel findings. The buggy lines are two comment lines in build.rs that describe incorrect dependency-type discrimination logic -- the comments are misleading about when bytecode vs Leo compilation should be used. With blame confidence C, this ground truth is the weakest in the group. The bug may be more about missing cases in the match logic than about those specific comment lines. Ground truth validity: questionable -- the buggy lines point to comments rather than executable code, and the fix PR description focuses on leo upgrade skip behavior, not dependency classification.
Ground truth: leo/tests/integration.rs:18-54 and extensive sections -- Stale test infrastructure that should switch from snarkOS/devnet to devnode, fix ErasedJson::pretty, and refactor to per-test unit tests.
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 0 | 2 | 5 TP-novel: zombie process, stale directory, argument mismatches. Missed core devnode migration |
| Greptile | X | 0 | 2 | 2 TP-novel: timeout on polling loop, argument count mismatch |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | X | 0 | 2 | 6 TP-novel: infinite loop, version-blind check, stale files, thread-safety. Missed devnet-to-devnode |
| Opus diff | X | 0 | 2 | 6 TP-novel: same secondary issues (infinite loop, stale files, CwdRaii thread-safety) |
Analysis: Every tool except CodeRabbit produced comments near the buggy area but none identified the core issue: the test infrastructure needs to be rewritten to use devnode instead of devnet. Tools focused on code quality issues within the existing code (infinite loops, thread safety, stale files) rather than recognizing the architectural problem. Both diff-only agents found 6 novel issues each. Ground truth has blame confidence C and the "bug" is essentially "this entire test file needs rewriting" -- more of an architectural debt item than a point bug. This is inherently hard for code review tools to flag.
Ground truth: compiler/ast/src/struct/mod.rs:82-123 -- Inefficient vec! + collect_vec + concat pattern for constructing record members; should use preallocated Vec with push/extend.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 3 | 3 | Direct hit: identified vec+concat allocation inefficiency, suggested preallocated Vec (exact fix). Plus 2 TP-novel, 1 FP |
| Greptile | X | 0 | 3 | 6 TP-novel (inverted owner mode, dropped visibility, Array panic, silently omitted structs). Missed allocation bug |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | X | 0 | 2 | 2 TP-novel (inverted mode, Array panic). Missed allocation bug |
| Opus diff | X | 0 | 2 | 2 TP-novel (dropped visibility, Array panic). 1 FP on owner mode |
Analysis: Only Copilot caught the actual bug with det=3, suggesting the exact preallocated Vec fix that matches the ground truth. This is a performance bug (unnecessary allocations) rather than a correctness bug, which most tools don't prioritize. Interestingly, multiple tools (Greptile, Sonnet, Opus) all independently found what appears to be a real novel issue: the inverted owner mode mapping (is_private() -> Mode::Public). Whether this is intentional Aleo semantics or a real bug is unclear. The Array type panic finding also appeared across 3 tools independently. Ground truth is valid -- Copilot's detection proves it's findable.
Ground truth: compiler/parser-lossless/src/tokens.rs:106-117 -- #[token] used instead of #[regex] for group::, signature::, Future::* patterns. Logos #[token] matches literal strings, not regex metacharacters.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 3 | 4 | Direct hit: #[token] vs #[regex] with root cause explanation. 1 TP + 7 TP-novel (typos, error spans) |
| Greptile | . | 0 | 2 | 4 TP-novel (duplicate NodeID, BitXorAssign typo, missing quote, unhandled ParseError). Missed #[token] bug |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | . | 0 | 3 | 6 TP-novel (BitXorAssign typo, missing quote, duplicate node ID, etc.). Missed core bug |
| Opus diff | X | 3 | 4 | Direct hit: #[token] vs #[regex] with example (group::GEN). 1 TP + 10 TP-novel. Exceptional quality |
Analysis: Copilot and Opus diff both achieved perfect detection (3) with exceptional quality (4). Both explained the Logos framework semantics: #[token] matches exact literal strings while #[regex] interprets regex patterns. This requires knowing the Logos crate's API, which apparently both models have in training data. Greptile and Sonnet diff missed the core bug but both found the same secondary issues (BitXorAssign display string typo, missing closing quote in Address). This case is a tiny (3 lines changed) high-signal bug -- the simplest ground truth in the group, and the tools that caught it did so perfectly.
Ground truth: leo/cli/commands/execute.rs:314-321 -- Edition defaulting to 1 via unwrap_or(1) before proper network fetching; incorrect print logic for credits.aleo.
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 0 | 1 | 1 low-value comment about emoji spacing |
| Greptile | X | 1 | 2 | 3 TP-novel (forbid->deny downgrade, OnceLock safety, CI coverage). Touched line 321 but only flagged style issue |
| CodeRabbit | . | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | X | 2 | 3 | 1 TP on buggy lines (misleading "already included" message). 1 TP-novel (emoji spacing) |
| Opus diff | . | 0 | 0 | No comments at all |
Analysis: Sonnet diff-only came closest with det=2, identifying a real logic issue in the flagged buggy region (the "already included" message for credits.aleo is misleading since the program is still passed to add_programs_with_editions). However, it didn't identify the broader edition-handling problem. Greptile touched the right area but only flagged a style issue. Opus produced nothing. The bug is about how editions are fetched from the network vs defaulted -- requires understanding the Leo deployment model to recognize unwrap_or(1) as wrong. With blame confidence C, the ground truth may be imprecise about exactly which lines are buggy.
Ground truth: leo/cli/cli.rs:244-264 -- leo test exits with code 0 when tests fail; needs CliError::tests_failed variant and exit code propagation.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | X | 1 | 2 | 6 TP-novel: alignment issues in deploy/upgrade, regex patterns, JSON output error. Touched cli.rs but flagged different issues |
| Greptile | X | 0 | 2 | 2 TP-novel: JSON serialization expect, structurally always-true field |
| CodeRabbit | . | 0 | 0 | Skipped review due to branch config |
| Sonnet diff | . | 0 | 3 | 8 TP-novel: enum misuse, silent no-op for --json-output, integer truncation. Strong quality but missed exit code bug |
| Opus diff | . | 0 | 2 | 3 TP-novel: silent failure UX, lost broadcast info, behavioral change on failure |
Analysis: No tool caught the exit-code bug despite it being conceptually simple: leo test should return non-zero when tests fail. The fix adds a new CliError::tests_failed variant. This is a missing error handling path -- the code doesn't have the wrong behavior, it's missing behavior entirely. Tools focused on the JSON output feature (the larger portion of the diff) rather than the exit-code issue. Sonnet diff achieved the highest quality (3) with 8 novel findings. The ground truth is valid -- the fix PR explicitly lists "Add CliError::tests_failed" as a to-do item.
| Tool | Cases caught (det>0) | Avg detection | Avg quality | Total TPs | Total novel | Total FPs |
|---|---|---|---|---|---|---|
| Copilot | 4 (004,018,020,003) | 1.07 | 2.0 | 4 | 63 | 6 |
| Greptile | 3 (004,009,020-partial) | 0.71 | 1.71 | 2 | 28 | 0 |
| CodeRabbit | 1 (004) | 0.14 | 0.50 | 1 | 11 | 12 |
| Sonnet diff | 3 (004,009,021) | 0.64 | 1.86 | 10 | 44 | 5 |
| Opus diff | 3 (004,020,003-partial) | 0.86 | 1.57 | 3 | 51 | 8 |
Note: CodeRabbit was rate-limited in 10 of 14 cases, severely impacting its results.
-
Only one case (leo-004) was caught by all tools. The blocking-HTTP-in-async pattern is a well-known anti-pattern that all models recognize. This sets a baseline for "easily catchable" bugs.
-
CodeRabbit was rate-limited in most cases, producing no actual review in 10/14 cases. Its results are not comparable to the other tools and should be excluded or re-run.
-
Sonnet diff-only and Opus diff-only diverge significantly on specific cases. Opus caught leo-020 (token vs regex) perfectly while Sonnet missed it; Sonnet caught leo-021 (edition handling) while Opus produced nothing. This suggests model-level variance matters.
-
Copilot was the only tool to catch the performance bug (leo-018) with a perfect det=3. Copilot's strength appears to be pattern-based code quality checks (allocations, API misuse).
-
Ground truth validity concerns:
- leo-001: Bug is "missing code" -- inherently undetectable from diff review.
- leo-013 (blame C): Buggy lines point to comments, not executable code.
- leo-016 (blame C): Bug is "rewrite entire file" -- not a point defect.
- Cases with blame confidence C should be flagged for ground truth review.
-
Scorer strictness on leo-009: Opus found 6 comments on the exact buggy lines but all scored FP because the suggested fix direction was wrong (refine regex vs remove it). Consider whether locating the bug with wrong-direction fix should score det=1 instead of det=0 with 6 FPs.
-
Novel findings are consistently high quality. Even when tools miss the target bug, they find 2-10 legitimate secondary issues per case. The "owner mode inversion" finding appeared independently in 4 tools for leo-018, suggesting it may be a real undiscovered bug.
-
Diff-only context is sufficient for pattern bugs (leo-004, leo-020) but insufficient for architectural bugs (leo-016) or external-API-change bugs (leo-003, leo-009).
14 active cases. Tools evaluated: Sonnet diff-only, Opus diff-only, Sonnet repo (diff+repo), Opus repo (diff+repo). Copilot, Greptile, and CodeRabbit had no results for any case in this group (only ran on leo-002, leo-020, leo-022).
Scoring key: Det = detection (0-3), Qual = review quality (0-4), X = caught, . = missed.
Ground truth: leo/cli/commands/deploy.rs:386-389 -- When a program is skipped during deployment, it is not added to the VM, causing downstream failures.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 0 | No comments produced |
| Opus diff | . | 0 | 0 | No comments produced |
| Sonnet repo | . | 0 | 0 | No comments produced |
| Opus repo | . | 0 | 0 | No comments produced |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Total wipeout -- no tool produced any comments. The bug is a missing vm.process().write().add_program() call in a continue path during deployment iteration. This is a logic omission in control flow that requires understanding the deploy loop's semantics. Diff-only tools see the fix being added but have no context for why the existing code was wrong. Even repo-level tools failed, likely because the bug is domain-specific (understanding that skipped programs still need VM registration). Ground truth appears solid (confidence A, clear fix PR).
Ground truth: compiler/parser/src/parser/expression.rs:689, compiler/passes/src/loop_unrolling/statement.rs:41, plus many test expectation files -- Parser and type-checker reject empty arrays ([], [u32;0]) when they should be allowed.
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 3 | 4 comments on other issues: to_usize closure error msg bug, missing integer type assertion for repeat count, repeat_count_not_evaluated flag logic, no early return on oversized array |
| Opus diff | . | 0 | 0 | No comments produced |
| Sonnet repo | . | 0 | 0 | 3 comments on unrelated issues (same themes as Sonnet diff) |
| Opus repo | . | 0 | 2 | 4 comments on unrelated issues: to_usize closure bug, missing type validation, Type::Err not returning early, repeat_count_not_evaluated flag |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: This is a feature-addition PR (adding empty array support), not a traditional bug fix. The ground truth (blame confidence C) reflects uncertainty -- the "bug" is really that the language did not support [u32;0]. All tools missed the actual ground truth but Sonnet diff and Opus repo found genuine secondary issues in the new code. Sonnet diff scored qual=3 despite det=0, recognizing that its findings were real issues in the fix implementation. The ground truth may be too diffuse for meaningful detection scoring -- it spans parser, loop unrolling, and many test expectation files.
Ground truth: Spans many files -- compiler/passes/src/const_prop_unroll_and_morphing.rs:52-54, compiler/passes/src/common/symbol_table/mod.rs:111-117, compiler/passes/src/const_propagation/ast.rs:496-500, plus parser, path resolution, type checking, and symbol table creation files. The core bug: local constants were being deleted during reset_but_consts, losing them between iteration steps.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 0 | Timed out (600s), no comments |
| Opus diff | . | 0 | 2 | 10 comments, none on actual bug. Found: visit_member_access path bug, lookup_path fallback issue, eq_user returning false on None path, fs::read_to_string unwrap panic, visit_module not saving program_name, etc. Scored 6 FP. |
| Sonnet repo | . | 0 | 0 | Timed out (600s), no comments |
| Opus repo | . | 0 | 1 | 5 comments: lookup_path prioritization, regex compilation perf, legalize_path fallthrough, reconstruct_path prefix issue, check_shadow_variable limitation |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: This is a massive diff (the PR touches 20+ files across parser, passes, AST, and symbol table). Both Sonnet runs timed out at 600s. Opus diff produced 10 comments (6 FP) but none identified the actual const propagation retention bug. The ground truth is spread across many files, making it hard for any tool to identify the conceptual thread. The core issue (local constants lost during reset_but_consts) requires understanding the multi-pass compilation architecture. Opus diff's comments, while not hitting the target, showed some genuine code quality findings (unwrap panics, regex perf).
Ground truth: Version strings stuck at 3.3.0 across .resources/release-version, Cargo.toml (many workspace members), and test expectation JSON files.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 0 | No comments produced |
| Opus diff | . | 0 | 0 | No comments produced |
| Sonnet repo | . | 0 | 0 | No comments produced |
| Opus repo | . | 0 | 0 | No comments produced |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Total wipeout. This is a version bump PR -- the "bug" is that the version was 3.3.0 and needed to be 3.3.1. All tools correctly produced no bug-related comments because the diff is purely mechanical version string updates. This case arguably should not be in the dataset -- a version bump is not a code logic bug. The ground truth is technically correct but tests an inappropriate category for AI review tools.
Ground truth: compiler/passes/src/type_checking/program.rs:50-73 and 164-171, 428-432, plus test files -- Missing validation pass for names containing "aleo" keyword and record name prefix conflicts.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | X | 0 | 3 | 4 comments: found real bug in tuple_windows prefix detection (non-adjacent prefixes missed), span discarded for prefix record, perf issue with sym::aleo.to_string() allocation, error definition limitations |
| Opus diff | X | 3 | 4 | 2 comments: correctly identified adjacent-only prefix check bug (same as Sonnet), plus case-sensitive "aleo" check bypass. TP=1 |
| Sonnet repo | . | 0 | 0 | Timed out (600s), no comments |
| Opus repo | . | 0 | 0 | No comments produced |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Interesting divergence: diff-only tools outperformed repo-level tools. Opus diff scored det=3/qual=4, the highest in this group, by precisely identifying the tuple_windows bug in record name prefix checking. Sonnet diff found the same bug but was scored caught=True/det=0 -- this seems like scorer inconsistency since both identified the same algorithmic flaw. The scorer may have been stricter with Sonnet because its comment focused on the fix implementation rather than the pre-existing missing validation. Sonnet repo timed out; Opus repo produced nothing despite having full repo access. This suggests that for localized logic bugs in new code, diff-only review is more effective than repo-level review.
Ground truth: errors/src/common/formatted.rs -- extensive changes across lines 50-391. Clippy lint failures due to new Rust version: unreachable conditions, unnecessary unwraps, formatting issues in error display code.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | X | 1 | 2 | 5 comments: coordinate system mismatch in multiline detection, double-printing underline, \r\n line ending assumption, env var perf, repeated lines().count() |
| Opus diff | X | 2 | 3 | 5 comments: unreachable condition (line 219), ANSI escape alignment bug, unnecessary unwrap (line 219), NOCOLOR perf, \r\n assumption. TP=2 |
| Sonnet repo | . | 0 | 0 | No comments produced |
| Opus repo | X | 2 | 3 | 4 comments: line_num coordinate mismatch, format string bug with {:start$}, same format bug in print_multiline_underline, \r\n assumption |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Diff-only tools performed well here. Opus diff (det=2, qual=3) found the unreachable condition and unnecessary unwrap on line 219, both exactly the kinds of issues Clippy would flag. Sonnet diff (det=1) found the same coordinate mismatch but was scored lower. Opus repo also scored det=2/qual=3 with 4 focused comments. The Clippy-style nature of this bug (code smell, dead code, unnecessary operations) plays to AI strengths. Sonnet repo again produced nothing. Ground truth is valid but very broad (many lines).
Ground truth: leo/package/src/package.rs:152-164 -- main_path.display() used instead of test_file_path.display() in error message for test file write failure.
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | X | 3 | 4 | 3 comments: C0 precisely identifies the wrong variable in error message (exact bug). C1 found handle_test always returns Ok(). C2 found reconstruct_call panic. TP=1 |
| Opus diff | X | 3 | 3 | 6 comments: C1 identifies the bug (main_path vs test_file_path). Also found handle_test Ok() issue, catch_unwind RNG state, handler emit without advancing ledger. TP=1 |
| Sonnet repo | . | 0 | 0 | Timed out (600s) |
| Opus repo | X | 3 | 4 | 6 comments: C0 precisely identifies copy-paste bug. Also found interpreter state leak, HashMap iteration order, indirect script calling, RNG after panic, @test on async transition |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Strong performance across diff-only and Opus repo. All three producing tools identified the exact copy-paste bug (wrong variable in error message). This is a classic "wrong variable" pattern that AI tools excel at detecting. Both Sonnet diff and Opus repo scored det=3/qual=4 (perfect). Opus diff scored slightly lower on quality (3) despite also finding the bug. The additional findings (handle_test returning Ok, RNG state after panic) are genuine secondary issues. Ground truth is clear and well-defined (confidence B only because it's a minor cosmetic bug).
Ground truth: tests/tests/compiler/finalize/unknown_mapping_operation_fail.leo:11-18 -- Test expectation change from Fail to Pass, reflecting a fix to return-in-constructor type checking.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | X* | 0 | 1 | 3 comments: missing newline (trivial), unused variable (FP), test expectation change without compiler changes. Scored caught=True but det=0 |
| Opus diff | . | 0 | 0 | No comments produced |
| Sonnet repo | . | 0 | 0 | Timed out (600s) |
| Opus repo | X* | 0 | 1 | 2 comments: noted test expectation change implications (finalize blocks with return types), missing newline. Scored caught=True but det=0 |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Problematic ground truth. The "bug" is a test expectation change with no visible compiler source changes in the diff. The ground truth points only to test file lines, making it nearly impossible for tools to identify the underlying compiler bug. Sonnet diff and Opus repo were scored caught=True/det=0, meaning they noticed something was off with the test expectation change but could not identify the actual bug. This case has weak signal -- the fix PR presumably includes compiler changes that are not captured in the buggy_lines. This case should be reviewed for ground truth completeness.
Ground truth: leo/cli/commands/clean.rs:39-48 -- remove_dir_all called without existence checks, fails on fresh/cleaned projects. Also missing manifest check.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | X | 3 | 4 | 10 comments: C0 precisely identifies missing existence check for remove_dir_all. Also found .DS_Store count issue, TestnetV0 hardcoding, env file handling, unwrap on network response. TP=1 |
| Opus diff | X | 3 | 4 | 6 comments: C0 precisely identifies missing existence check. Also found program name format mismatch, missing build dir cleanup, wrong error type in symbol(). TP=1 |
| Sonnet repo | X | 3 | 3 | 7 comments: C0 identifies missing existence check (at different line). Also found SourcePath/build path confusion, TestnetV0 hardcoding, program name mismatch, wrong error type |
| Opus repo | X | 3 | 3 | 7 comments: C1 identifies missing existence check. Also found program name format mismatch, missing validation in from_program, unwrap panics, assert_eq in release, deploy path name mismatch |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Best-performing case in the group -- all four tools caught the bug with det=3. The missing existence check before remove_dir_all is a straightforward defensive programming issue that all tools identified as their first or second comment. This is the ideal case for AI code review: a clear, localized error in error handling that is visible in the diff. The diff is large (refactoring of CLI commands) which also surfaced many secondary findings. Ground truth is solid.
Ground truth: leo/cli/commands/add.rs:28-34 -- Broken --clear flag with default_value = "false" on bool, plus broken --local and --network flag definitions.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 2 | 3 | 4 comments: found is_valid_program_name empty string bug, identified broken clap default_value="false" on bool (line 34 = buggy line). TP=1, but scored caught=False because broader flag deprecation issue was missed |
| Opus diff | . | 2 | 3 | 4 comments: same empty string bug, same clap bool issue (line 124/34), plus unwrap on non-UTF-8 path. TP=1 |
| Sonnet repo | . | 0 | 0 | 4 comments: found removed error variants still referenced (compile error), empty string validation, digit-starting names, unwrap on to_str(). All FP relative to ground truth |
| Opus repo | . | 0 | 2 | 4 comments: same removed error variants compile error, empty string, byte indexing smell, digit-starting names |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Both diff-only tools found the default_value = "false" issue on the clear bool flag (one of the buggy lines) and scored det=2. However, neither was scored as "caught" because they missed the broader intent of the fix (hiding/deprecating the broken flags). The scorer distinguished between "found a bug in one of the buggy lines" and "understood the full scope of the issue." Sonnet repo and Opus repo found a genuine compile error (removed error variants still referenced in execute.rs) that is arguably a more severe bug than the one being fixed, but it was scored as FP relative to the ground truth.
Ground truth: errors/src/errors/compiler/compiler_errors.rs:117-119 (new error definition) plus test expectation files -- The monomorphization logic fails to const-evaluate expressions like 2 * N before generating monomorphized names, causing Foo::[2 * 8] and Foo::[16] to be treated as different types.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 2 | 6 comments: typo "argment", misleading hash comment, regex compilation perf, TODO in type table, missing call graph update, &Vec instead of &[]. All miss core bug |
| Opus diff | . | 0 | 2 | 5 comments: same misleading hash comment, regex perf, same typo, same TODO issue, same &Vec clippy lint. All miss core bug |
| Sonnet repo | . | 0 | 0 | Timed out (600s) |
| Opus repo | . | 0 | 2 | 7 comments: found eq_user ignoring const_arguments (significant!), TODO in type table, misleading hash comment, regex perf, typo, monomorphized_structs never read, &Vec clippy lint |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: All tools missed the core monomorphization bug. The ground truth points to an error definition and test expectation files, but the actual conceptual bug (expressions not const-evaluated before monomorphization) is a design-level issue that requires understanding the compilation pipeline. Opus repo's C0 (eq_user ignoring const_arguments) is the closest any tool came -- it identifies a real type equality problem that is conceptually related to the monomorphization issue. Several tools found the same set of secondary issues (typo, regex perf, TODO). Sonnet repo timed out again. The ground truth may be too narrowly defined -- the error definition lines are more symptom than cause.
Ground truth: leo/cli/commands/execute.rs:140-162 -- Wrong program ID used in leo execute (using last() instead of correct program lookup), plus test_network feature addition.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 3 | 9 comments: found parse_delimited_list empty string bug, uninterpolated format strings (2x), truncate panic, confirm prompt not interpolated, shadowed path variable, missing sleep, dead match arm, consensus version 5 mapped to V4 |
| Opus diff | . | 0 | 2 | 6 comments: confirm prompt not interpolated, missing sleep, uninterpolated error message, same empty string bug, consensus version 5->V4, dead match arm |
| Sonnet repo | . | 0 | 3 | 7 comments: same themes -- remove_dir_all without check, SourcePath bug, TestnetV0 hardcoding, missing sleep, uninterpolated error, confirm prompt, consensus version |
| Opus repo | . | 0 | 3 | 6 comments: parse_delimited_list, uninterpolated error, confirm prompt (2 bugs), consensus version, missing sleep, reversed DFS not valid topo sort |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Total miss on the ground truth bug (wrong program ID from last()), but every tool found many genuine secondary bugs in this large refactoring diff. The uninterpolated format strings, missing sleep call, and consensus version mapping error are all real bugs. Opus repo's C5 (reversed pre-order DFS not being a valid topological sort) is a particularly sophisticated finding. The ground truth bug (using last() to get program name) is subtle and requires understanding the program dependency ordering. This is a case where tools provided high value despite missing the specific tracked bug.
Ground truth: compiler/ast/src/interpreter_value/evaluate.rs:750-757, 884-888 -- checked_rem (mod/remainder) incorrectly allowed on signed integers (I8-I128), and missing comparison support for addresses and structs.
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 2 | 5 comments: integer parsing expect() panic, Mul with unsuffixed Group operands, struct map key collision (HashMap vs HashMap), ternary not propagating expected_ty, function call argument truncation. FP=2 |
| Opus diff | . | 0 | 1 | 4 comments: ternary expected_ty, tuple expected_ty, struct map collision, assignment expected_ty. All about type propagation, none about the actual bug |
| Sonnet repo | . | 0 | 0 | Timed out (600s) |
| Opus repo | . | 0 | 2 | 5 comments: struct init value stack leak, struct map collision, ternary expected_ty, integer Pow with unsuffixed rhs, tuple expected_ty |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Complete miss across all tools. The bug (signed integer mod and missing address/struct comparison) is in the evaluate.rs match arms -- a very pattern-heavy file with dozens of similar (Value::XX(x), Value::XX(y)) => x.checked_rem(y).map(Value::XX) lines. Tools focused on type propagation issues in the interpreter instead. The struct map key collision (HashMap vs HashMap) was flagged by three tools and is a genuine concern, but it is not the tracked bug. The mod-on-signed-integers bug requires domain knowledge (Aleo's instruction set does not support signed mod). Ground truth is clear and well-targeted.
Ground truth: leo/cli/cli.rs:317, leo/cli/commands/execute.rs:64-97, leo/cli/commands/run.rs:19-98 -- Removing file-based input handling and adding network dispatch for run/execute commands. Also NETWORK default changed to "mainnet".
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | -- | -- | -- | No result for this case |
| Greptile | -- | -- | -- | No result for this case |
| Sonnet diff | . | 0 | 3 | 8 comments: move out of &self (compile error), open_manifest before init, Package::initialize with wrong path, removed name validation, missing NETWORK in .env, removed dotenv fallback, move in test (compile error), case-sensitive network name |
| Opus diff | X | 3 | 4 | 6 comments: C0 move out of &self (compile error), C1 handle_execute moves from &Execute (compile error), C2 handle_run moves from &Run (compile error), C3 open_manifest before init, C4 missing NETWORK in .env, C5 removed name validation. TP=2 |
| Sonnet repo | X | 3 | 3 | 8 comments: C0 move from &self, C1 open_manifest before init, C2 handle_run moves from &Run, C3 handle_execute moves from &Execute, C4 missing NETWORK, C5 removed dotenv fallback, C6 removed validation, C7 case-sensitive network |
| Opus repo | X | 3 | 4 | 7 comments: C0 open_manifest before init, C1 wrong path for Package::initialize, C2 move from &self, C3 moves from &Execute, C4 moves from &Run, C5 missing NETWORK, C6 case-sensitive network |
| CodeRabbit | -- | -- | -- | No result for this case |
Analysis: Three of four tools caught the bug (only Sonnet diff missed). The key finding is that handle_execute and handle_run take &self/&Execute/&Run (shared references) but attempt to move owned String/Vec fields out of them -- a Rust compile error. This is exactly the kind of bug AI tools excel at: type-system violations visible in the diff. Opus diff scored det=3/qual=4. Sonnet diff found the same individual issues but was scored caught=False, possibly because its comments were less precisely targeted at the buggy lines. Ground truth confidence C reflects that this is a refactoring PR with many changes, making it hard to define "the bug" precisely.
| Tool | Cases with results | Caught | Avg Det (caught) | Avg Qual |
|---|---|---|---|---|
| Copilot | 0/14 | -- | -- | -- |
| Greptile | 0/14 | -- | -- | -- |
| CodeRabbit | 0/14 | -- | -- | -- |
| Sonnet diff | 14/14 | 4 | 1.0 | 1.7 |
| Opus diff | 14/14 | 5 | 2.8 | 1.8 |
| Sonnet repo | 14/14 | 2 | 1.5 | 0.4 |
| Opus repo | 14/14 | 5 | 2.2 | 1.8 |
Copilot, Greptile, and CodeRabbit had no results for any of these 14 cases. All analysis is limited to the Anthropic SDK tools (Sonnet and Opus in diff-only and diff+repo modes).
Opus diff caught 5/14 vs Sonnet diff's 4/14. More importantly, Opus diff's average detection score when catching was 2.8 vs Sonnet diff's 1.0. Opus produced more precise, higher-confidence findings.
Sonnet repo caught only 2/14 cases and timed out on 6/14 (600s limit). The extra repo context caused Sonnet to spend its budget exploring the codebase rather than analyzing the diff. Opus repo (5/14) matched Opus diff (5/14) but caught different cases -- repo context helped on leo-033, leo-044, leo-049, and leo-050 where diff-only also succeeded, but repo uniquely helped on none that diff-only missed.
For well-defined bugs (leo-044 wrong variable, leo-050 missing existence check, leo-064 move from shared reference), diff-only review was sufficient and faster. Repo context added value mainly for understanding broader impact.
Sonnet repo timed out (600s) on 6 of 14 cases (leo-027, leo-030, leo-044, leo-049, leo-054, leo-062), producing zero comments in each. These were generally larger diffs. This is a significant reliability issue.
Cases requiring Aleo-specific knowledge (leo-023 VM registration, leo-062 signed mod semantics) were universally missed. Tools performed best on general programming bugs: wrong variables (leo-044), missing null checks (leo-050), type system violations (leo-064), and code quality (leo-033).
- leo-029 (version bump) is arguably not a reviewable bug -- no tool should be penalized for missing it.
- leo-049 has incomplete ground truth -- only test file lines, no compiler source lines.
- leo-026 (confidence C) is a feature addition, not a bug fix.
- leo-064 (confidence C) is a broad refactoring with diffuse bug definition.
Even when tools missed the tracked bug, they frequently found genuine issues: compile errors, panic-on-unwrap, format string bugs, performance problems, dead code. In cases like leo-060, the secondary findings may be more practically valuable than the tracked bug.
Cases leo-066 through leo-095 (14 active, excluding excluded cases).
Tool legend:
- Copilot = GitHub Copilot (PR review, run-04-pr-tools)
- Greptile = Greptile (PR review, run-04-pr-tools)
- CodeRabbit = CodeRabbit (PR review, run-04-pr-tools)
- Sonnet diff = Agent SDK with Sonnet, diff-only (run-01-sdk-diffonly)
- Opus diff = Agent SDK with Opus, diff-only (run-04-opus-diffonly)
- Sonnet repo = Agent SDK with Sonnet, diff+repo (run-03-sdk-repo-v2)
- Opus repo = Agent SDK with Opus, diff+repo (run-05-opus-repo)
Ground truth: leo/cli/commands/deploy.rs:244-298, leo/cli/commands/execute.rs:350-380
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 0 | 2 | Flagged unwrap() panic risk and unreachable code paths, missed dependency ordering |
| Greptile | Y | 0 | 2 | Found novel issues in deploy/execute but missed core dependency-ordering bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | Y | 1 | 3 | Identified issues in the right area, flagged network I/O robustness |
| Opus diff | Y | 0 | 2 | Found novel issues but missed the dependency ordering bug |
| Sonnet repo | Y | 3 | 0 | One TP matching a buggy line in deploy.rs |
| Opus repo | N | 0 | 2 | Completely missed the dependency ordering bug |
Analysis: The ground truth covers a large span (54 buggy lines across two files) about missing network dependency resolution before deployments. This is a high-level architectural issue -- programs must have their dependencies deployed first. Most tools flagged code quality issues in the same files (unwrap on network I/O, unreachable branches) but missed the core ordering logic. The B-tier blame confidence is justified: the introducing PR "First attempt to report transaction status" added the deployment flow without proper dependency ordering, but the connection is indirect. Sonnet repo got det=3 but qual=0, suggesting it identified the right code mechanically but provided no useful review context -- a scorer inconsistency where detection and quality diverge sharply.
Ground truth: interpreter/src/cursor.rs:634, 654-660
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 2 | Reviewed other areas, missed ArrayAccess bug entirely |
| Greptile | N | 0 | 3 | Thorough review but completely missed the cursor.rs ArrayAccess bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 2 | Missed ArrayAccess/CoreConstant handling bug |
| Opus diff | N | 0 | 2 | Missed ArrayAccess handling bug |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 3 | Thorough review but missed the ArrayAccess bug in cursor.rs |
Analysis: Zero tools caught this bug. The ground truth is strong (A-tier blame) -- the Debugger PR (#28441) introduced interpreter code where ArrayAccess fell through to a CoreConstant handler instead of being handled as an array index operation. The bug is subtle: it requires understanding the interpreter's expression evaluation dispatch and noticing that ArrayAccess was missing its own handler, falling into the constant lookup path at line 654. The introducing PR was large ("Debugger" -- no description) making it hard for diff-only tools to understand context. Even repo-access tools missed it because the fix is a small logic addition that requires understanding the interpreter's value stack semantics.
Ground truth: compiler/passes/src/code_generation/expression.rs:87-88, compiler/passes/src/type_checking/expression.rs:85-86, plus many test expectation files
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 1 | 2 | Touched buggy line 88 in codegen but addressed typo, not the unsuffixed literal bug |
| Greptile | Y | 0 | 1 | Caught typo in 'lierals' on buggy line but missed the actual type inference bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 1 | Missed the unsuffixed literal inference for array indices |
| Opus diff | Y | 1 | 2 | Flagged related area but never identified the core bug |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 1 | 2 | Flagged related area but missed the core bug |
Analysis: The bug is that the unsuffixed-literals feature (#28642) failed to handle array indices -- visit_expression_reject_numeric was used instead of allowing inference to u32. Copilot and Greptile both landed on the buggy lines but for superficial reasons (a typo "lierals" in a comment on line 88). The judge awarded caught=Y and det=1 for touching the right lines, but neither tool understood the actual type inference issue. This is a case where surface-level findings accidentally overlap with ground truth lines without identifying the real bug. The A-tier blame is valid -- the unsuffixed literals PR directly introduced the gap.
Ground truth: Multiple files in compiler/parser, compiler/passes (destructuring, codegen, SSA, type_checking), compiler/ast Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 1 | Missed unsuffixed integer handling in codegen entirely |
| Greptile | N | 0 | 1 | Missed the unsuffixed integer codegen bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 0 | No comments produced |
| Opus diff | N | 0 | 0 | No comments produced |
| Sonnet repo | N | 0 | 0 | No comments produced |
| Opus repo | N | 0 | 1 | Discussed ownership/move semantics, irrelevant to the bug |
Analysis: Universal miss across all tools. The bug is in the code_generation pass: when a constant is initialized with an unsuffixed integer and then used as an array index, the codegen emits incorrect VM instructions. The introducing PR (#28557, "Refactoring, especially Expression and Statement") was a massive refactoring that changed Expression, Statement, Access, and Literal types -- the buggy lines are scattered across 6+ files in parser, destructuring, SSA, and codegen passes. Both Sonnet and Opus diff-only produced zero comments, likely overwhelmed by the size of the refactoring diff. The A-tier blame is appropriate -- the refactoring directly restructured the expression types that the later unsuffixed-integer feature depended on.
Ground truth: errors/src/common/formatted.rs:91-92, 95, 110, 113-122, 125, 128
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 2 | 3 | Directly targeted formatted.rs line 123, identified Display impl issue |
| Greptile | N | 0 | 0 | No comments produced |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 3 | Good review quality but missed formatted.rs Display bug |
| Opus diff | N | 0 | 0 | No comments produced |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 2 | Missed the formatted.rs bug entirely |
Analysis: Only Copilot caught this. The bug is in the Formatted error type's Display impl: when the source file cannot be found, it returned early without printing the error message. The introducing PR (#28566, "Revise SourceMap and related types") rewrote how source locations were tracked, and the new Display impl silently swallowed errors when source files were missing. Copilot correctly identified the Display implementation issues in formatted.rs (det=2, qual=3). The B-tier blame is appropriate -- the SourceMap rewrite introduced the new Display code, but the connection between the SourceMap changes and the error display fallback requires understanding the error reporting pipeline.
Ground truth: compiler/ast/src/expressions/mod.rs:38-44
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 2 | Missed the expressions/mod.rs module declaration bug |
| Greptile | N | 0 | 2 | Missed the expressions/mod.rs bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 1 | Missed module declaration/re-export issue |
| Opus diff | N | 0 | 1 | Identified the area (ternary, unary, value modules) but not the bug |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | Y | 1 | 2 | Flagged the right file and line in expressions/mod.rs |
Analysis: The ground truth here is questionable. The buggy lines are module declarations (mod ternary, pub use ternary::*, etc.) in expressions/mod.rs, blamed back to "Core circuits" PR #1885 from 2022. The fix PR title says "Clone all internal record inputs as outputs" which is about codegen correctness for record types, yet the ground truth points to basic module re-exports. The B-tier blame is generous -- the introducing PR is from 3 years before the fix, suggesting the blame traced to the original file structure rather than the actual logic bug. Only Opus repo (det=1) flagged anything in the right file, getting partial credit. This case likely has a ground truth validity issue: the real bug is in record input/output cloning logic, not in the module declarations that blame happened to trace to.
Ground truth: compiler/passes/src/write_transforming/mod.rs:32-38, compiler/passes/src/write_transforming/visitor.rs:36-42, compiler/passes/src/type_checking/expression.rs:875
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 3 | 3 | Correctly identified 'assignemnts' -> 'assignments' and other typos |
| Greptile | Y | 3 | 3 | Correctly identified the 'assignemnts' typo |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 1 | Missed typos in comments entirely |
| Opus diff | N | 0 | 1 | Missed documentation typos |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 1 | Missed documentation typos |
Analysis: This is a typo-fix case: the fix PR corrects spelling errors in comments ('assignemnts' -> 'assignments') and a panic message. Copilot and Greptile both caught the typos perfectly (det=3, qual=3). All agent SDK configurations missed them, which is notable -- the in-house agent focused on logic and correctness patterns rather than comment/doc spelling. The C-tier blame is correct: the typos were introduced in the "Write to array and struct members" PR (#28559) which added the write_transforming pass with misspelled comments. This is a low-severity case (severity=low) where commercial PR tools excel at surface-level catches that the agent SDK deprioritizes.
Ground truth: compiler/ast/src/expressions/literal.rs:116-150, 157, 192-197
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 0 | 2 | Found novel issues but missed group literal type-checking validation |
| Greptile | Y | 0 | 2 | Missed group literal type checking, found other issues |
| CodeRabbit | N | 0 | 0 | Skipped review (too many changes) |
| Sonnet diff | N | 0 | 2 | Missed literal.rs Display formatting and leading zeros bug |
| Opus diff | N | 0 | 2 | Missed the literal.rs bug |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 1 | 2 | Flagged related area but never identified actual bug |
Analysis: The bug involves missing validation for group literals and improper handling of leading zeros in field/scalar literals in the Display implementation. The introducing PR (#28383, "hex and binary literals for integers") added hex/binary/octal literal support but the Display implementation for non-integer literals (field, group, scalar) did not strip leading zeros or validate group literal syntax. The fix was 189 lines added across 12 files (large PR). Copilot and Greptile both marked caught=Y but det=0, meaning the judge found novel findings in the right area but nothing that identified the actual leading-zeros/group-validation issue. The large diff size (PR size=large) likely diluted attention. This is a domain-specific Aleo type system bug that requires understanding the VM's literal format requirements.
Ground truth: leo/cli/commands/deploy.rs:24, 45-64, leo/cli/commands/mod.rs:30-31
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 0 | No comments produced |
| Greptile | Y | 0 | 3 | Found novel issues but missed Deploy->LeoDeploy rename and negative parsing |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | Y | 0 | 3 | Found novel issues, missed the actual rename/parsing bug |
| Opus diff | Y | 0 | 2 | Missed the Deploy rename and negative parsing fix |
| Sonnet repo | Y | 3 | 0 | Mechanically matched a buggy line |
| Opus repo | N | 0 | 2 | Missed the Deploy rename bug |
Analysis: The fix renames Deploy to LeoDeploy and fixes negative value parsing for leo run. The C-tier blame traces to "Leo Deploy" PR #26901 from 2024 -- a naming collision where the Deploy struct name was too generic. Multiple tools scored caught=Y but det=0, meaning they found novel things in the same files without identifying the actual naming issue. Sonnet repo got det=3 but qual=0, another case of mechanical line matching without meaningful review. Copilot produced zero comments, an unusual total miss. The ground truth is somewhat weak: renaming a struct is more of a code quality improvement than a bug fix, and the blame back to the original Deploy introduction (C-tier) reflects this ambiguity.
Ground truth: compiler/passes/src/type_checking/checker.rs:274, 288-292, 297, 311-316
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 0 | No comments produced |
| Greptile | N | 0 | 2 | Missed Pedersen hash type-checking bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 2 | Missed Pedersen hash type checking |
| Opus diff | N | 0 | 2 | Missed Pedersen hash type checking |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 1 | Missed the overly permissive type checking |
Analysis: Universal miss. The bug is in type checking for Pedersen hash functions: the checker was too permissive, allowing composite types (structs/records) as inputs to Pedersen64/Pedersen128 when only primitive types should be allowed. The fix PR is tiny (6 lines added, 2 deleted, 1 file) -- just updating comments and tightening type checks. The A-tier blame correctly traces to PR #28481 ("Refactor and improve type checking") which introduced the overly permissive check. This is a domain-specific bug requiring knowledge of Aleo's Pedersen hash circuit constraints. The introducing PR was a large type-checker refactoring, so the specific Pedersen constraint was likely an oversight in a sea of changes. Copilot produced zero comments, suggesting the diff was too small or uninteresting for its heuristics.
Ground truth: compiler/ast/src/functions/mod.rs:139-145, compiler/ast/src/struct/mod.rs:138, 140
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 0 | No comments produced |
| Greptile | Y | 1 | 3 | Flagged Display-related issues in the right area |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | Y | 3 | 3 | Correctly identified struct/mod.rs:138 missing struct keyword in Display |
| Opus diff | Y | 3 | 3 | Correctly identified the struct Display bug |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 1 | 3 | Flagged Display issues but missed specific known bugs |
Analysis: The bugs are in Display implementations: the function output formatting (mod.rs:139-145) and struct Display missing the struct keyword (struct/mod.rs:138). Both Sonnet diff and Opus diff scored det=3, qual=3 -- correctly identifying the struct Display bug at the exact line. This is one of the strongest agent SDK performances in the dataset. Greptile got det=1 (in the right area but not exact). Copilot produced zero comments. The repo-access variants both degraded: Sonnet repo produced nothing, Opus repo got det=1. This reinforces the finding that diff-only mode outperforms repo mode -- the focused diff allowed the agent to scrutinize the Display implementations closely.
Ground truth: compiler/parser/src/parser/expression.rs:266-293
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 1 | 2 | Touched negation-folding logic but didn't identify double-negation bug |
| Greptile | N | 0 | 1 | Completely missed the double-negation parsing bug |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | Y | 1 | 2 | Identified negation folding area, partial detection |
| Opus diff | N | 0 | 0 | No comments produced |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 0 | No comments produced |
Analysis: The bug is that the parser folds -(-x) by absorbing the minus sign into the literal, producing --x instead of negating a negative value. This was a long-standing issue (introduced in 2023, PR #2522 "Add id to AST nodes", A-tier blame). The fix correctly separates the unary negation from the literal value. Only Copilot and Sonnet diff scored det=1, touching the negation-folding code without fully identifying the double-negation issue. The introducing PR is interesting: adding id fields to AST nodes didn't directly create the bug, but the blame traces to the parser code that was reorganized during that change. The real bug pattern (absorbing negation into literals) was present earlier, making the A-tier blame somewhat generous.
Ground truth: compiler/ast/src/functions/mod.rs:139, compiler/ast/src/program/mod.rs:43, compiler/ast/src/statement/block.rs:40, compiler/ast/src/statement/conditional.rs:40, compiler/ast/src/statement/iteration.rs:56-57, errors/src/emitter/mod.rs:46, compiler/ast/src/program/program_scope.rs:45, 48, 51
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | N | 0 | 0 | No comments produced |
| Greptile | N | 0 | 1 | Mischaracterized PR as mechanical formatting |
| CodeRabbit | N | 0 | 0 | Rate-limited, never reviewed |
| Sonnet diff | N | 0 | 0 | No comments produced |
| Opus diff | N | 0 | 0 | No comments produced |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 0 | No comments produced |
Analysis: Universal miss -- no tool caught anything. The fix PR is large (291 lines added, 72 deleted, 23 files changed) and completely rewrites Display implementations across the entire AST. The buggy lines are spread across 7 files covering functions, programs, statements, and error formatting. The C-tier blame traces to PR #2142 ("Fix output type for finalize block") from 2022 -- the original Display implementations were wrong from the beginning. This is essentially a complete rewrite case where the "bug" is accumulated technical debt in Display formatting rather than a discrete defect. The massive diff (23 files) with many incremental formatting changes likely overwhelmed all tools. Most tools produced zero comments, suggesting the diff appeared as routine code cleanup.
Ground truth: compiler/passes/src/common/graph/mod.rs:108
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| Copilot | Y | 3 | 3 | Correctly identified the 'to to' duplicate word typo |
| Greptile | N | 0 | 1 | Missed the 'to to' typo |
| CodeRabbit | N | 0 | 0 | Auto reviews disabled, skipped |
| Sonnet diff | Y | 3 | 4 | Correctly identified the exact 'to to' typo on line 108 |
| Opus diff | Y | 3 | 4 | Correctly identified the exact 'to to' typo on line 108 |
| Sonnet repo | N | 0 | 0 | No useful output |
| Opus repo | N | 0 | 0 | No comments produced |
Analysis: A single-line typo case: "to to" should be "to" on line 108 of the graph module. Copilot and both diff-only agents caught it perfectly (det=3). Both agent diff-only variants scored qual=4 (exceptional) -- the highest quality score in the dataset. Greptile missed it, and both repo-access agents produced nothing. The C-tier blame traces to PR #2178 ("Add DiGraph data structure") from 2023. This case validates that simple, focused typo detection works well in diff-only mode. The stark contrast between diff-only (3/3) and repo-access (0/0) for both Sonnet and Opus is a clean demonstration of the focus-dilution effect.
| Case | Difficulty | Blame | Copilot | Greptile | CodeRabbit | Sonnet-diff | Opus-diff | Sonnet-repo | Opus-repo |
|---|---|---|---|---|---|---|---|---|---|
| leo-066 | hard | B | Y(0/2) | Y(0/2) | N(0/0) | Y(1/3) | Y(0/2) | Y(3/0) | N(0/2) |
| leo-067 | medium | A | N(0/2) | N(0/3) | N(0/0) | N(0/2) | N(0/2) | N(0/0) | N(0/3) |
| leo-071 | hard | A | Y(1/2) | Y(0/1) | N(0/0) | N(0/1) | Y(1/2) | N(0/0) | N(1/2) |
| leo-072 | hard | A | N(0/1) | N(0/1) | N(0/0) | N(0/0) | N(0/0) | N(0/0) | N(0/1) |
| leo-073 | medium | B | Y(2/3) | N(0/0) | N(0/0) | N(0/3) | N(0/0) | N(0/0) | N(0/2) |
| leo-074 | hard | B | N(0/2) | N(0/2) | N(0/0) | N(0/1) | N(0/1) | N(0/0) | Y(1/2) |
| leo-075 | medium | C | Y(3/3) | Y(3/3) | N(0/0) | N(0/1) | N(0/1) | N(0/0) | N(0/1) |
| leo-082 | hard | B | Y(0/2) | Y(0/2) | N(0/0) | N(0/2) | N(0/2) | N(0/0) | N(1/2) |
| leo-085 | medium | C | N(0/0) | Y(0/3) | N(0/0) | Y(0/3) | Y(0/2) | Y(3/0) | N(0/2) |
| leo-086 | easy | A | N(0/0) | N(0/2) | N(0/0) | N(0/2) | N(0/2) | N(0/0) | N(0/1) |
| leo-088 | medium | B | N(0/0) | Y(1/3) | N(0/0) | Y(3/3) | Y(3/3) | N(0/0) | N(1/3) |
| leo-090 | medium | A | Y(1/2) | N(0/1) | N(0/0) | Y(1/2) | N(0/0) | N(0/0) | N(0/0) |
| leo-091 | hard | C | N(0/0) | N(0/1) | N(0/0) | N(0/0) | N(0/0) | N(0/0) | N(0/0) |
| leo-095 | medium | C | Y(3/3) | N(0/1) | N(0/0) | Y(3/4) | Y(3/4) | N(0/0) | N(0/0) |
Format: Caught(Det/Qual)
| Tool | Caught | Rate |
|---|---|---|
| Copilot | 7/14 | 50% |
| Greptile | 5/14 | 36% |
| CodeRabbit | 0/14 | 0% |
| Sonnet diff | 5/14 | 36% |
| Opus diff | 5/14 | 36% |
| Sonnet repo | 2/14 | 14% |
| Opus repo | 1/14 | 7% |
-
CodeRabbit completely failed -- rate-limited or auto-reviews disabled on every single case in this group, producing zero actionable reviews.
-
Repo access severely degrades performance -- Sonnet drops from 36% to 14%, Opus from 36% to 7%. The two Sonnet-repo "catches" both had det=3 but qual=0 (mechanical line matching, no useful review), making them arguably false catches.
-
Diff-only agents excel at Display/formatting bugs -- leo-088 is the standout: both Sonnet and Opus diff-only scored det=3, qual=3 on the struct Display bug, while all other tools missed it.
-
Typo detection split -- Copilot catches typos (leo-075, leo-095) while agent SDK misses them in diff-only mode. But for leo-095 (single-line typo), the agent SDK caught it perfectly. The difference is that leo-075's typos were in documentation comments that the agent deprioritized.
-
Universal misses on hard+deep bugs -- leo-067 (interpreter ArrayAccess), leo-072 (unsuffixed integer codegen), leo-086 (Pedersen type checking), and leo-091 (complete Display rewrite) were missed by all tools. These require deep domain knowledge of Aleo's compiler architecture.
-
Scorer inconsistency: caught=Y with det=0 -- Multiple cases show tools marked as "caught" with detection=0 (e.g., leo-066 Copilot, leo-082 Copilot/Greptile). The judge awarded caught=Y for TP-novel findings in the same file but det=0 because those findings did not match the specific ground truth bug. This creates a misleading catch rate where "caught" does not mean "identified the known bug."
-
Ground truth quality varies -- leo-074 (B-tier, module declarations blamed from 3 years ago), leo-085 (C-tier, struct rename), and leo-091 (C-tier, complete rewrite) have ground truth that does not cleanly map to a single discrete bug, making tool evaluation against these cases less reliable.
Cases analyzed: leo-104, leo-115, leo-116, leo-117, leo-119, leo-136, leo-139, leo-157, leo-159, leo-162, leo-164, leo-168, leo-181, leo-193, leo-200, leo-202
Tool abbreviations used in tables:
- S-diff = Sonnet agent-sdk diff-only (run-01)
- S-repo = Sonnet agent-sdk diff+repo (run-02)
- S-v2 = Sonnet agent-sdk diff+repo v2 (run-03)
- O-diff = Opus agent-sdk diff-only (run-04)
- O-repo = Opus agent-sdk diff+repo (run-05)
- CR = CodeRabbit PR-tool (run-04)
- Copilot = GitHub Copilot PR-tool (run-04)
- Greptile = Greptile PR-tool (run-04)
Ground truth: leo/cli/commands/mod.rs:287-291; utils/retriever/src/retriever/mod.rs:525-528
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes | 3 | 4 | 301 branch is dead code; ureq returns non-2xx as Err, making match unreachable |
| S-repo | No* | 3 | 4 | Same finding -- 301 match arms unreachable in both files |
| S-v2 | Yes | 3 | 0 | Same core finding; also flagged catch-all arm for non-200 2xx codes |
| O-diff | No* | 3 | 4 | 301 branch dead code in both files; error info lost via broad map_err |
| O-repo | No* | 3 | 4 | map_err()? catches 4xx/5xx before status match; 301 branch dead code |
| CR | No | 0 | 0 | Rate-limited, no review produced |
| Copilot | Yes | 2 | 3 | send_json returns Error::Status for non-2xx; map_err swallows it |
| Greptile | Yes | 3 | 4 | 301 match arms unreachable in both files (P1 severity) |
Analysis: Strong ground truth -- the ureq 2.x behavior where non-2xx responses are Err variants makes the status match arms dead code. This was the easiest case for tools: every agent run identified the core bug (det=3 across the board), though some scored caught=False due to scorer disagreement (likely a threshold issue in the judge). CodeRabbit was rate-limited throughout all PR-tool runs in this group. S-v2 scored qual=0 despite finding the bug, which appears to be a scorer anomaly (missing reasoning field). Greptile and Sonnet diff-only performed best here.
Ground truth: leo/cli/cli.rs:55-58; leo/cli/commands/example.rs:20-40; leo/package/src/example.rs:20-40; leo/cli/commands/mod.rs:27
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 2 | 3 | Flagged tic_tac_toe vs tictactoe naming inconsistency, wrong input paths |
| S-repo | No | 3 | 4 | Wrong include_str! paths (tic_tac_toe -> tictactoe), wrong input filenames |
| S-v2 | Yes | 3 | 0 | context.dir() returns parent dir not project dir; include_str! path wrong |
| O-diff | Yes | 2 | 3 | Inconsistent directory names between example.rs files; dead parameter |
| O-repo | Yes | 3 | 4 | include_str! references non-existent tic_tac_toe path; wrong input filenames |
| CR | No | 0 | 0 | Rate-limited, no review produced |
| Copilot | Yes | 1 | 2 | Flagged unused parameters but missed core path bugs |
| Greptile | Yes | 2 | 3 | Wrong input file path, wrong directory naming, dead code |
Analysis: Multi-file bug involving incorrect example paths and naming inconsistencies. Agent models performed well, especially with repo context -- Sonnet diff+repo and Opus diff+repo both achieved det=3 by verifying actual filesystem paths against include_str! macros. Copilot found the right files but focused on surface-level issues (unused parameters) rather than the path bugs. The bug spans many files, making it easier to catch partially but hard to identify completely.
Ground truth: leo/cli/cli.rs:76-80; leo/cli/commands/mod.rs:38-40; leo/cli/mod.rs:26-28; leo/cli/query_commands/*.rs (multiple files, 8+ files)
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes* | 0 | 3 | Found &&/|| logic bug in validation functions (novel, not the known bug) |
| S-repo | Yes* | 0 | 3 | Same &&/|| logic bug; missed clap ArgGroup naming conflicts |
| S-v2 | Yes* | 3 | 0 | Same &&/|| logic bug plus more findings |
| O-diff | Yes* | 0 | 3 | &&/|| logic bug, leading slash in URL, integer underflow |
| O-repo | Yes* | 0 | 3 | &&/|| logic bug in all three validation functions |
| CR | No | 0 | 0 | Rate-limited, no review produced |
| Copilot | Yes* | 2 | 2 | Found required_unless_present_any referencing non-existent "range" arg |
| Greptile | Yes* | 0 | 3 | Inverted validation logic; leading slash in URL paths |
Analysis: Low-confidence ground truth (C). The known bug is about clap ArgGroup naming conflicts that only manifest in the dev profile, plus argument short-name conflicts -- a very framework-specific issue. Every tool independently discovered the &&/|| logic bug in validation functions (is_valid_hash, is_valid_transaction_id, etc.), which is a real bug but not the known one. This is a strong case where tools found genuinely important novel bugs while missing the specific known issue. The scoring (det=0 for most) is technically correct but undersells the tools' value. Copilot uniquely caught a clap-specific issue (missing "range" argument reference). The C-tier blame confidence suggests the ground truth itself may be imprecise for this large refactoring PR.
Ground truth: leo/cli/commands/mod.rs:115,117
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | Focused on update.rs: silent error swallowing in quiet mode |
| S-repo | No | 0 | 2 | Same update.rs issues; updater.rs return type mismatch |
| S-v2 | No | 0 | 0 | All comments on update.rs error handling |
| O-diff | No | 0 | 2 | Silent error discard in quiet mode; circular glob re-export |
| O-repo | No | 0 | 2 | Same quiet-mode error swallowing; circular re-export |
| CR | No | 0 | 0 | Rate-limited (FP=2 for meta-comments) |
| Copilot | No | 0 | 0 | No comments produced |
| Greptile | No | 0 | 2 | update.rs error silently discarded in quiet mode (P0) |
Analysis: Universal miss. The bug is incorrect help message text (cosmetic string content), which is essentially impossible for automated tools to catch without domain knowledge of what the correct help text should say. Every tool that produced comments instead focused on the update.rs changes in the same PR, finding legitimate issues there (error swallowing in quiet mode was flagged by 5+ tools). This case demonstrates a fundamental limitation: tools cannot validate semantic correctness of human-facing strings. The C-tier blame confidence and "easy" difficulty rating are contradictory -- easy for a human reviewer who reads help text, but near-impossible for tools.
Ground truth: compiler/parser/src/parser/file.rs:144-156,182-194
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | Stale parameter name in visitor; grammar nit ("an struct") |
| S-repo | No | 0 | 2 | Token::Circuit Display arm removed but variant kept; no error for circuit token |
| S-v2 | No | 0 | 0 | No comments |
| O-diff | No | 0 | 0 | No comments produced |
| O-repo | No | 0 | 1 | Wrong claim about Member struct usage; Token::Circuit Display issue |
| CR | No | 0 | 0 | Rate-limited (FP=2) |
| Copilot | No | 0 | 1 | Grammar nits only ("an struct", message text) |
| Greptile | No | 0 | 0 | No comments produced |
Analysis: Universal miss. The bug is that the parser accepts both commas and semicolons as struct member separators when it should only accept commas. This is a parser-level semantic issue embedded in a large circuit-to-struct renaming PR. The signal-to-noise ratio is very low: the actual parser logic change is buried among many mechanical renames. Even with repo context, no tool identified the separator acceptance logic. The C-tier blame confidence and "hard" difficulty are well-justified. This is a case where only a domain expert who understands Leo's grammar specification would catch the issue.
Ground truth: leo/cli/commands/account.rs (lines 24-380, many locations -- seed flag definition and all usage sites)
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes* | 0 | 2 | PathBuf::parse infallible; private key as CLI arg; missed --seed |
| S-repo | Yes* | 0 | 2 | Misleading error messages; errors silently discarded; missed --seed |
| S-v2 | Yes* | 3 | 0 | Error propagation issues; cli_runtime_error misuse |
| O-diff | Yes* | 0 | 2 | Misleading test comments; private key exposure; missed --seed |
| O-repo | No | 0 | 1 | Only comment about error code insertion order |
| CR | No | 0 | 0 | Rate-limited (FP=2) |
| Copilot | Yes* | 0 | 2 | Error propagation issues; misleading test comments; missed --seed |
| Greptile | Yes* | 0 | 2 | Misleading test comments; private key exposure; missed --seed (FP=1) |
Analysis: Despite A-tier blame confidence, every tool missed the actual security bug: the --seed flag allowing deterministic nonce generation in cryptographic signatures. The caught=True flags in the scorer are misleading -- they appear to be triggered by comments in the right file (account.rs) but none identify the seed/nonce security issue. This is a cryptography-domain bug that requires understanding why deterministic nonces in signatures are dangerous. Tools instead found legitimate but unrelated issues (error handling, private key exposure via CLI args). The scorer's caught=True with det=0 pattern across multiple tools is a red flag suggesting the caught field is too permissive when a tool comments on the right file but the wrong issue.
Ground truth: compiler/ast/src/types/type_.rs:106,117,122,128; compiler/passes/src/common/symbol_table/mod.rs:125; compiler/passes/src/flattening/flatten_expression.rs:86; compiler/passes/src/type_checking/checker.rs:185
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes | 1 | 2 | eq_flat_relax_composite doc mismatch; asymmetric Mapping comparison |
| S-repo | Yes | 1 | 2 | Mapping arm uses eq_flat for value instead of eq_flat_relax_composite |
| S-v2 | Yes | 3 | 0 | Same Mapping arm inconsistency; dead error variants |
| O-diff | No | 0 | 0 | No comments produced |
| O-repo | Yes | 2 | 3 | Mapping comparison uses eq_flat (strict) instead of relaxed variant |
| CR | No | 0 | 0 | Rate-limited (FP=2) |
| Copilot | Yes | 1 | 1 | Flagged same Mapping asymmetry at type_.rs:122 |
| Greptile | No | 0 | 2 | Dead error variants only; missed eq_flat issue |
Analysis: The core bug is introducing eq_flat_relax_composite for relaxed type equality checking, with the key issue being inconsistent application (Mapping value uses strict eq_flat). Multiple tools identified the asymmetry in the Mapping arm at type_.rs:122, which is a genuine part of the bug. However, the broader changes across symbol_table, flatten_expression, and checker were missed by all tools. Opus diff-only produced zero comments (unusual), while Opus diff+repo gave the cleanest single-comment analysis (det=2, qual=3). This is a case where partial detection was common but full understanding of the cross-cutting type system change was not achieved by any tool.
Ground truth: Cargo.toml:124,142-148
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 0 | No comments produced |
| S-repo | No | 0 | 0 | No comments produced |
| S-v2 | No | 0 | 0 | No comments produced |
| O-diff | No | 0 | 1 | Stray dependencies = { } in linter/Cargo.toml (unrelated) |
| O-repo | No | 0 | 0 | No comments produced |
| CR | No | 0 | 0 | Rate-limited |
| Copilot | No | 0 | 1 | Same stray dependencies line; repeated dependency versions |
| Greptile | No | 0 | 1 | Same stray dependencies line |
Analysis: Universal miss. The bug is a missing feature flag (deflate) on the self_update crate's zip dependency in Cargo.toml. This is a build configuration issue that requires understanding that the update binary needs deflate decompression support. No tool can reason about Cargo feature flags' runtime effects from a diff alone. The few comments produced all targeted an unrelated linter/Cargo.toml formatting issue. Despite A-tier blame confidence (the fix is localized and clear), the bug type (dependency feature configuration) is outside the capability envelope of current code review tools.
Ground truth: compiler/parser/src/parser/file.rs:298-313,331
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | Wrong type name in simple_node_impl! macro; external input mapping gap |
| S-repo | No | 3 | 4 | Found doc bug (parse_input says "output" not "input") plus battleship bugs |
| S-v2 | No | 0 | 0 | simple_node_impl! wrong type; SSA/flattening iteration issues |
| O-diff | No | 0 | 1 | simple_node_impl! wrong type; external input not mapped |
| O-repo | Yes | 0 | 2 | simple_node_impl! wrong type; external input not mapped; parse_input eat() |
| CR | No | 0 | 0 | Rate-limited |
| Copilot | Yes | 3 | 3 | Directly identified parse_input doc saying "output" instead of "input" |
| Greptile | Yes | 3 | 3 | Identified parse_input doc bug at file.rs:299 |
Analysis: The bug is a copy-paste error in doc comments (parse_input's docstring says "function output" instead of "function input"). Sonnet diff+repo actually found this (det=3) but was scored caught=No -- a scorer inconsistency. Copilot and Greptile both correctly identified the exact bug. Most agent runs were distracted by the simple_node_impl!(FunctionOutputExternal) issue (a wrong type name in a macro invocation), which is a real novel bug but not the known one. This case shows how doc bugs in a PR with many real code changes get buried -- the tools often found more "interesting" bugs and missed the simpler doc fix.
Ground truth: README.md:87-109
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | MACOSX_DEPLOYMENT_TARGET issues in release.yml |
| S-repo | No | 0 | 1 | MACOSX_DEPLOYMENT_TARGET in release.yml |
| S-v2 | No | 0 | 0 | MACOSX_DEPLOYMENT_TARGET in release.yml |
| O-diff | No | 0 | 0 | No comments produced |
| O-repo | No | 0 | 0 | No comments produced |
| CR | No | 0 | 0 | Rate-limited (FP=2) |
| Copilot | Yes* | 1 | 1 | Touched README crates.io section but misidentified issue |
| Greptile | Yes* | 1 | 2 | Banner comment syntax; MACOSX_DEPLOYMENT_TARGET; missing version constraint |
Analysis: The bug is that the crates.io/cargo install section should be removed from README until snarkvm releases a compatible version. Copilot touched the right area but flagged a link/name mismatch rather than the need to remove the section entirely. Greptile commented on README but focused on different issues. Agent runs were entirely distracted by the CI workflow changes (MACOSX_DEPLOYMENT_TARGET), missing the README content issue. This is a domain-knowledge bug: you need to know that snarkvm hasn't released a new version to understand why cargo install instructions are problematic. CodeRabbit was uniquely useful here in the run-04 PR tools -- it actually scored det=2 with tp=6 on the testnet defaults (this is leo-164, not this case). Documentation correctness bugs that require external context remain very hard for tools.
Ground truth: errors/src/errors/utils/util_errors.rs:117; many .env files; run.sh scripts; leo/cli/commands/add.rs:31
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes* | 0 | 2 | Flagged hardcoded private keys in .env; missed testnet default issue |
| S-repo | No | 0 | 2 | add.rs panic on invalid network; parser error message change |
| S-v2 | No | 0 | 0 | Hardcoded network in env.rs; panic on invalid network |
| O-diff | No | 2 | 3 | Correctly identified hardcoded mainnet should be testnet in env.rs/account.rs |
| O-repo | No | 1 | 2 | Same hardcoded mainnet issue but in env.rs (not in ground truth lines) |
| CR | Yes | 2 | 2 | Identified mainnet exposure with private keys across .env files |
| Copilot | Yes | 2 | 3 | Flagged util_errors.rs:117 hardcoded mainnet; risky default for deploy |
| Greptile | No | 1 | 2 | Hardcoded mainnet in env.rs (not in ground truth line list) |
Analysis: A sprawling bug across 30+ files where "mainnet" should be "testnet". The ground truth spans .env files, shell scripts, error messages, and CLI defaults. Opus diff-only gave the best agent analysis (det=2), correctly identifying the pattern even though it targeted env.rs/account.rs rather than the exact ground truth lines. CodeRabbit -- in one of its few non-rate-limited reviews in this group -- identified the mainnet exposure pattern. Copilot correctly flagged util_errors.rs:117. The wide ground truth makes scoring tricky: tools that identify the correct class of bug in slightly different files than the ground truth list get det=1 instead of det=2+. This case highlights how ground truth for bulk-change PRs can be somewhat arbitrary in which specific lines are listed.
Ground truth: asg/src/lib.rs:27; ast/src/lib.rs:23-24; errors/src/common/mod.rs:34-35,45; errors/src/lib.rs:24,32; parser/src/lib.rs:23; test-framework/src/lib.rs:25-26
Blame confidence: C
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | circuit_self removal; deleted error variants |
| S-repo | No | 0 | 3 | Extra forward-slash in doc; 'cricuit' typo; stale error message prefix |
| S-v2 | No | 0 | 0 | No comments |
| O-diff | No | 0 | 3 | Deleted error variants shift error codes; 'covnert' typo unfixed |
| O-repo | No | 0 | 3 | 'covnert' typos; "should be be" double word; stale error prefix |
| CR | No | 0 | 0 | Skipped review (PR exceeded 150-file limit, 299 files) |
| Copilot | No | 0 | 0 | No comments produced |
| Greptile | No | 0 | 3 | Malformed doc comment; duplicated word; multiple error code shifts |
Analysis: Universal miss on the known bugs, which are doc-attribute fixes (include_str paths for lib.rs crate-level docs), a 'deserialze' typo, a double period, and a 'cleaneronce' missing space. This is a massive 299-file PR where CodeRabbit refused to review due to file count limits. The known bugs are needle-in-haystack documentation issues. However, every tool that produced comments found genuinely valuable novel bugs: 'covnert' typos, "should be be" duplications, malformed doc comments, and error code shifts from deleted variants. The C-tier blame confidence is appropriate -- the ground truth is documentation quality issues that are hard to attribute to a single introducing PR. The tools' novel findings (qual=3 from multiple tools) were arguably more valuable than the known bugs.
Ground truth: compiler/passes/src/type_checking/check_expressions.rs:480,482,648,711
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 0 | No comments produced |
| S-repo | No | 0 | 0 | No comments produced |
| S-v2 | No | 0 | 0 | No comments produced |
| O-diff | No | 0 | 0 | No comments produced |
| O-repo | No | 0 | 0 | No comments produced |
| CR | No | 0 | 0 | Rate-limited |
| Copilot | No | 0 | 1 | Only commented on .rustfmt.toml/.rusty-hook.toml config |
| Greptile | No | 0 | 1 | Only commented on .rusty-hook.toml and .rustfmt.toml config |
Analysis: Complete failure across all tools. The bug is that type inference panics (unwrap) when encountering an unknown variable -- the fix adds graceful error handling. This is a high-severity runtime bug that requires understanding the control flow: unwrap() on a lookup that can fail if the user writes invalid code. The introducing PR appears to be a large reformatting/tooling PR (.rustfmt.toml, .rusty-hook.toml changes), which explains why tools focused on config files rather than the type checker. The actual buggy lines in check_expressions.rs may not have had visible changes in the introducing PR's diff, making this bug invisible to diff-based review. This case strongly argues for whole-file analysis beyond just the diff.
Ground truth: compiler/ast/src/stub/function_stub.rs:271
Blame confidence: B
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 3 | 3 | Found _program unused parameter (the bug) but in FP-heavy review |
| S-repo | No | 0 | 0 | No comments produced |
| S-v2 | No | 0 | 0 | Wrong key_name usage; _is_async unused |
| O-diff | No | 0 | 0 | No comments produced |
| O-repo | No | 0 | 0 | No comments produced |
| CR | No | 0 | 0 | Rate-limited |
| Copilot | Yes | 2 | 2 | Directly flagged _program unused parameter at line 271 |
| Greptile | No | 0 | 1 | Only style nits (unnecessary braces, extra imports) |
Analysis: The bug is that the program parameter in from_finalize was renamed to _program, silently discarding what should be used for external struct resolution. Copilot was the only tool to directly and correctly flag this. Sonnet diff-only found it in comment 4 (det=3, tp=1) but was scored caught=No, likely because the overall review had 4 FPs diluting the signal. Multiple agent runs produced zero comments -- the diff may have appeared as a simple refactoring that didn't warrant review. This case demonstrates Copilot's strength at catching unused/renamed parameters, a pattern it's specifically trained for.
Ground truth: compiler/passes/src/flattening/flatten_program.rs:19; many tests/expectations/compiler/**/*.out files
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | Yes | 2 | 3 | Flagged flatten_program.rs:19 -- ProgramScope/Statement imports added without visible usage |
| S-repo | No | 0 | 0 | No comments produced |
| S-v2 | No | 0 | 0 | No comments produced |
| O-diff | Yes* | 0 | 1 | Flagged same line but misunderstood (imports are part of fix, not bug) |
| O-repo | No | 0 | 2 | Test expectation hashes changed; pre-existing error rename |
| CR | No | 0 | 0 | Rate-limited |
| Copilot | No | 0 | 1 | Test case assertion logic; error message text |
| Greptile | No | 0 | 1 | Debug assertion left in test; missing error variant |
Analysis: The bug is that the flattening pass was not processing finalize blocks, fixed by adding Statement/StatementReconstructor imports. Sonnet diff-only was the best performer (det=2), correctly noting the imports were added without corresponding function changes visible in the diff. Opus diff-only flagged the same line but misinterpreted it. The bulk of the ground truth is test expectation file changes, which no tool meaningfully analyzed. This is a compiler-internals bug where the diff is almost entirely test output changes with one small import fix -- the signal-to-noise ratio is extremely low. Tools with repo context paradoxically performed worse, possibly because they got lost exploring the large test output changes.
Ground truth: leo/cli/cli.rs:21,118-133
Blame confidence: A
| Tool | Caught | Det | Qual | Finding summary |
|---|---|---|---|---|
| S-diff | No | 0 | 2 | Bare exit in run.sh scripts (novel finding, wrong file) |
| S-repo | No | 0 | 2 | Same bare exit issue across example run.sh scripts |
| S-v2 | No | 0 | 0 | Same bare exit issue (10 FPs) |
| O-diff | No | 0 | 2 | Premature success message in build.rs; bare exit |
| O-repo | No | 0 | 2 | Package::open validation order change in build.rs |
| CR | No | 0 | 0 | Rate-limited (FP=2) |
| Copilot | Yes* | 1 | 2 | Touched cli.rs deprecation block but flagged wrong issue |
| Greptile | Yes* | 1 | 2 | Touched cli.rs:133 deprecation block but suggested simplification not removal |
Analysis: The bug is a deprecation warning for leo build that should be removed entirely. Copilot and Greptile both commented on the exact buggy lines but misdiagnosed: Copilot flagged private keys in .env files, while Greptile suggested simplifying the tracing span rather than removing the deprecation. Agent runs were entirely distracted by the bare exit pattern in example shell scripts (a real but unrelated issue). The S-v2 run produced 10 FPs all about bare exit, demonstrating how a repeating pattern across files can dominate the review. This is another case where understanding the intent (deprecation should be removed, not simplified) requires context that tools lack.
CodeRabbit was rate-limited on 15 of 16 cases (all except leo-164 where it got through and leo-168 where it refused due to file count). This makes CodeRabbit data essentially unusable for this group.
Scorer caught field is unreliable. Multiple cases show caught=True with det=0 (leo-116, leo-136) where tools commented in the right file but on unrelated issues. Conversely, leo-159's Sonnet diff+repo scored caught=No with det=3. The caught boolean and detection_score should be better aligned.
S-v2 (Sonnet repo v2) has missing reasoning across all cases (empty reasoning field), and many qual=0 scores despite finding bugs. This run appears to have a systematic scorer issue.
Novel findings were pervasive. In 8 of 16 cases (leo-116, leo-117, leo-136, leo-159, leo-168, leo-181, leo-193, leo-202), tools found legitimate bugs that were not in the ground truth. The &&/|| validation logic bug in leo-116 was found by every tool. The 'covnert' typos in leo-168 were caught by multiple agents.
Difficulty vs. tool performance:
- Easy cases (leo-104, leo-117): Tools caught leo-104 easily but universally missed leo-117 (help message text).
- Medium cases (leo-115, leo-157, leo-159, leo-162, leo-202): Mixed results. Path/naming bugs (leo-115) were catchable; config bugs (leo-157) were not.
- Hard cases (leo-116, leo-119, leo-136, leo-139, leo-164, leo-168, leo-181, leo-193, leo-200): Mostly missed. Compiler internals (leo-119, leo-139, leo-181) and security bugs (leo-136) were hardest.
Repo context did not consistently help. In several cases (leo-104, leo-200), diff-only outperformed diff+repo. The additional context may cause tools to explore tangential code paths rather than focusing on the changes.