Skip to content

eth/fetcher: refactor witness_manager to reduce complexity#2188

Open
cffls wants to merge 4 commits intodevelopfrom
witness_manager_refactor
Open

eth/fetcher: refactor witness_manager to reduce complexity#2188
cffls wants to merge 4 commits intodevelopfrom
witness_manager_refactor

Conversation

@cffls
Copy link
Copy Markdown
Contributor

@cffls cffls commented Apr 15, 2026

Description

Structural refactor of witness_manager.go plus targeted tests, guided by diffguard (https://github.com/0xPolygon/diffguard). Max cognitive complexity 44 → 10; Tier 1 / Tier 2 mutation score 100%.

eth/fetcher/witness_manager.go — Refactoring Impact

Refactor + test additions guided by diffguard.

Aggregate metrics

Metric Before After Change
File LOC 889 1,119 +230 (helpers + doc comments)
Function count 19 36 +17 extracted helpers
Max cognitive complexity 44 10 -77%
Mean complexity 7.6 3.9 -49%
Median complexity 4 3 -25%
Functions above threshold (10) 5 0 all cleared
Functions above 50 lines 5 3 -40%

The four worst offenders — before vs after split

tick() — was the worst

Lines Complexity
Before: tick() 130 44
After: tick() ~25 2
logPendingStatesAtTickLocked 14 8
collectReadyHashesLocked 38 10
extractPrematureOpsLocked 11 4
buildPeerRequestsLocked 20 6
dispatchPeerRequests 8 2
dispatchPeerFetches 20 6

loop()

Lines Complexity
Before: loop() 84 22
After: loop() 52 9
armTimerChan 20 2
logTimerTick 5 1

fetchWitness()

Lines Complexity
Before: fetchWitness() 83 15
After: fetchWitness() ~15 1
initiateWitnessFetch 33 7
awaitWitnessResponse 15 1
processWitnessResponse 23 3

rescheduleWitness()

Lines Complexity
Before: rescheduleWitness() ~40 14
After: rescheduleWitness() ~25 3
earliestPendingAnnounceLocked 12 7
stopAndDrainTimer 8 3

Tests

Before After Delta
Test file LOC 1,962 3,447 +1,485
Test function count 39 61 +22

New tests target specific mutation-testing gaps (~15) plus regression tests for bugs found during review (the loop() end-to-end retry test and the fixed nil-deref test guards).

Mutation testing

Before refactor After refactor + tests
Raw score ~54.7% (baseline) 96.1%
Tier 1 (logic) ~43% (many survivors) 100.0% (98/98 killed)
Tier 2 (semantic) ~56% 100.0% (22/22)
Tier 3 (observability) mostly noise 94.4% (remaining 15 are log.Debug removals, report-only)
Severity verdict FAIL PASS

diffguard section status

Check Before After
Cognitive complexity FAIL (4 violations, max 44) PASS
Code sizes FAIL (6 violations) FAIL (3 function size, 1 file size — structural, pre-existing)
Dependency structure PASS PASS
Churn-weighted complexity WARN (top score 351) PASS (top score 96)
Mutation testing FAIL PASS

…test gaps

Structural refactor of witness_manager.go plus targeted tests, guided by
diffguard (https://github.com/0xPolygon/diffguard). Max cognitive
complexity 44 → 10; Tier 1 / Tier 2 mutation score 100%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Runs diffguard (https://github.com/0xPolygon/diffguard) on every PR to
report cognitive complexity, function/file sizes, dependency structure,
churn-weighted complexity, and mutation testing at 10% sample rate
against the base branch.

Text report goes to the step summary (visible in the PR check UI) and a
JSON artifact is retained for 14 days for offline triage. Uses the
16-core bor runner to keep mutation turnaround bounded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lucca30
lucca30 previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review

Bug: Inverted condition in armTimerChan -- witness timer is disabled when items are pending

File: eth/fetcher/witness_manager.go lines 223-241

The condition in armTimerChan is inverted relative to the original code it replaces.

Original logic (removed):

if pendingCount > 0 {
    timerChan = m.witnessTimer.C   // timer ACTIVE when pending
} else {
    // stop/drain timer
}

New logic (armTimerChan):

if pendingCount != 0 {
    m.stopAndDrainTimer()
    return nil, lastTick           // timer DISABLED when pending
}
return m.witnessTimer.C, lastTick  // timer ACTIVE when nothing pending

The function doc comment says "Returns nil channel (never fires) when nothing is pending" but the code returns nil when something is pending. This means the timer that drives tick() (which dispatches witness fetch requests) will never fire when there are pending witness requests, breaking the core witness fetch retry loop.

Suggested fix -- change line 228 from if pendingCount != 0 to if pendingCount == 0:

if pendingCount == 0 {
    m.stopAndDrainTimer()
    return nil, lastTick
}

Also checked for CLAUDE.md compliance -- no other issues found.

The refactor introduced an inverted condition where armTimerChan
returned a nil timer channel when pending requests existed, preventing
tick() from ever firing through the timer path. The retry loop only
appeared to work because rescheduleWitness also pokes pokeCh.

Reported in PR #2188 review. Adds TestWitnessLoopDrivesFetchesForPending
to guard against regressions — it exercises the full loop→tick→
fetchWitness pipeline via real channels, which the existing TestLoop
did not.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

test

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

test2

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

test3

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code review -- Found 4 issues: 1 critical logic bug in the refactored timer logic, 1 boundary-condition inconsistency, and 2 nil dereference risks in tests. Detailed inline comments will follow.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code review

Found 4 issues: 1 critical logic bug in the refactored timer logic, 1 boundary-condition inconsistency, and 2 nil dereference risks in tests. See detailed comments below.


Issue 1 - Critical bug in armTimerChan

File: eth/fetcher/witness_manager.go, line 228

Bug (critical): The condition here is inverted compared to the original logic. The original code activates the timer when pendingCount > 0 and stops it otherwise. The new code does the opposite -- it calls stopAndDrainTimer and returns nil when pendingCount != 0, and activates the timer when nothing is pending.

The branches are swapped. The timer is disabled when items are pending and enabled when nothing is pending. This means tick() (which dispatches witness fetch requests) will never fire via the timer when blocks are waiting for witnesses, breaking witness fetching entirely.

The doc comment on line 220 describes the correct intended behavior, confirming this is an unintentional inversion. Fix: change the condition to pendingCount == 0.


Issue 2 - Boundary inconsistency

File: eth/fetcher/witness_manager.go, line 281

Bug (boundary inconsistency): This changes < to <=, but block_fetcher.go uses < for the same check (lines 571 and 1087). A block at exactly dist == -maxUncleDist (-7) will be accepted by the block fetcher but rejected by the witness manager, meaning it could be imported without its witness being fetched.

If this is an intentional tightening, it should be applied consistently to block_fetcher.go as well. Otherwise, revert to < to match.


Issue 3 - Test nil dereference A

File: eth/fetcher/witness_manager_test.go, lines 3230-3232

Bug (nil dereference): When m.pending[hash] is nil, the if body executes and m.pending[hash].op.origin in the Errorf call will panic. This test is specifically designed to catch mutations that remove pending entries, exactly the scenario where the nil dereference occurs.

Fix: add a nil guard -- check for nil first with t.Fatal, then check the origin field in an else-if branch.


Issue 4 - Test nil dereference B

File: eth/fetcher/witness_manager_test.go, lines 3270-3272

Bug (nil dereference): No nil guard before m.pending[hash].op.origin. If checkCompleting removes the entry (the exact failure scenario this test catches), this panics instead of reporting a clean test failure.

Fix: same pattern as Issue 3 -- add a nil guard with t.Fatal before accessing the origin field.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Detailed code suggestions for the 4 issues found

Issue 1 fix (witness_manager.go:228): Change the condition from pendingCount != 0 to pendingCount == 0. The current code inverts the timer logic -- it disables the timer when items are pending and enables it when nothing is pending, which is backwards from the original loop() behavior and the doc comment on line 220.

Issue 2 fix (witness_manager.go:281): Change <= back to < to match block_fetcher.go lines 571 and 1087, or update block_fetcher.go consistently if the tightening is intentional.

Issues 3 and 4 fix (witness_manager_test.go:3230-3232 and 3270-3272): Add a nil guard before accessing m.pending[hash].op.origin. Check for nil first with t.Fatal, then check the origin field in an else-if branch. Both locations have the same pattern -- accessing a struct field on a potentially nil map value without checking nil first.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

test

Three issues from PR #2188 code review:

- handleNeed's distance check was accidentally tightened from
  `dist < -maxUncleDist` to `<=` during the refactor, introducing a
  boundary mismatch with block_fetcher.go (which uses `<`). A block at
  exactly dist == -maxUncleDist would be accepted by block_fetcher but
  dropped by witness_manager, potentially importing blocks without
  their witnesses. Reverted to match block_fetcher.

- TestWitnessHandleFilterResultSkipsAlreadyPending dereferenced
  `m.pending[hash].op.origin` inside the nil-check branch, panicking on
  the exact mutation it was meant to catch (pending entry removal).
  Now fails cleanly with t.Fatal when entry is nil.

- Same nil deref pattern in TestWitnessCheckCompletingSkipsAlreadyPending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Inline Comment 1 -- eth/fetcher/witness_manager.go line 228

Bug: Logic inversion -- timer disabled when items ARE pending

The condition is inverted from the original code. The original loop() armed the timer (timerChan = m.witnessTimer.C) when pendingCount > 0 and stopped it when nothing was pending. This refactored version does the opposite:

  • pendingCount != 0 (items ARE pending) -> stops timer, returns nil channel
  • pendingCount == 0 (nothing pending) -> returns live timer channel

Since tick() is only called from case <-timerChan: in the select loop, a nil channel means tick() never fires when there are pending witness requests -- silently breaking all timer-driven witness fetching.

The function own doc comment confirms the intent: "Returns nil channel (never fires) when nothing is pending" -- but the code returns nil when pendingCount != 0.

Suggested fix:

if pendingCount == 0 {

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Inline Comment 2 -- eth/fetcher/witness_manager.go line 281

Inconsistent distance check -- block_fetcher.go uses strict <

This was changed from < to <=, but block_fetcher.go uses dist < -maxUncleDist in both its distance checks (line 571, line 1087). With <=, a block at exactly maxUncleDist (7) blocks behind the chain head is now rejected by the witness manager while being accepted by the block fetcher -- creating a gap where a block is queued for import but its witness is dropped.

If this change is intentional, consider aligning block_fetcher.go as well and adding a test for the boundary case.

Suggested fix:

if dist := int64(number) - int64(m.parentChainHeight()); dist < -maxUncleDist {

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Review Summary

This PR refactors witness_manager.go to reduce complexity by extracting helper methods. The structure is cleaner, but there are two issues that need to be addressed before merging:

1. CRITICAL: Logic inversion in timerChan() (line 228)
The condition pendingCount != 0 is inverted -- it disables the timer when items ARE pending and enables it when nothing is pending. This silently breaks all timer-driven witness fetching. The fix is to change != to ==.

2. Inconsistent distance check in tooOld() (line 281)
The operator was changed from < to <=, creating an off-by-one inconsistency with block_fetcher.go which still uses strict <. A block at exactly maxUncleDist behind the chain head would be accepted by the block fetcher but have its witness dropped by the witness manager.

Requesting changes -- the logic inversion in timerChan() is a correctness bug that would prevent witness fetching from working.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
11.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@cffls
Copy link
Copy Markdown
Contributor Author

cffls commented Apr 15, 2026

@claude Please review again. The previously identified bugs have been addressed and fixed.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 84.56790% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.96%. Comparing base (438596d) to head (d855a08).

Files with missing lines Patch % Lines
eth/fetcher/witness_manager.go 84.56% 18 Missing and 7 partials ⚠️

❌ Your patch check has failed because the patch coverage (84.56%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2188      +/-   ##
===========================================
+ Coverage    51.93%   51.96%   +0.02%     
===========================================
  Files          884      884              
  Lines       155446   155455       +9     
===========================================
+ Hits         80731    80778      +47     
+ Misses       69509    69476      -33     
+ Partials      5206     5201       -5     
Files with missing lines Coverage Δ
eth/fetcher/witness_manager.go 88.62% <84.56%> (+5.09%) ⬆️

... and 17 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/fetcher/witness_manager.go 88.62% <84.56%> (+5.09%) ⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Claude finished @cffls's task in 2m 3s —— View job


Re-review: witness_manager refactor

  • Verify previously identified bugs are fixed
  • Read full diff against origin/develop
  • Review witness_manager.go changes
  • Review witness_manager_test.go changes
  • Review diffguard.yml workflow
  • Post final review

Previously identified issues — all resolved

Issue Status
Critical: inverted condition in armTimerChan (!= 0== 0) Fixed (line 228)
Boundary inconsistency (<=< to match block_fetcher.go) Fixed (line 284), with a clear explanatory comment
Test nil deref A (m.pending[hash].op.origin without nil guard) Fixed (line 3231–3234: nil check with t.Fatal before field access)
Test nil deref B (same pattern) Fixed (line 3275–3278: same nil-guard pattern)

All four fixes are correct.


Fresh review of current state

Overall assessment: LGTM. The refactor is well-structured — the extracted helpers have clear ownership of lock scope, the method decomposition follows the original control flow faithfully, and the new tests are precisely targeted at mutation-testing gaps.

One nit (pre-existing, cosmetic)

witness_manager.go line 592 — when the request creation succeeds (err == nil) but the hash is no longer pending, the error message wraps a nil error:

m.handleWitnessFetchFailureExt(hash, "", fmt.Errorf("request initiation failed: %w", err), false)

This produces "request initiation failed: <nil>". Not a bug (the removePending=false path just bumps the announce time for backoff), but a slightly confusing error string. This was carried over from the original code, so not a regression — just noting it for a future cleanup pass.

What looks good

  • Lock discipline: collectReadyHashesLocked no longer drops and re-acquires the lock mid-iteration to call markWitnessUnavailable — it collects hashes and marks them after releasing the lock. This eliminates a subtle class of bugs where the map could change between unlock/relock.
  • stopAndDrainTimer: Properly extracted and reused in newWitnessManager, armTimerChan, and rescheduleWitness — eliminates three copies of the stop-drain idiom.
  • TestWitnessLoopDrivesFetchesForPending: End-to-end regression test that would have caught the armTimerChan inversion. Good addition.
  • Test helpers (newWitnessManagerForTest, addPendingEntry): Clean, reduce boilerplate across the 22 new tests.
  • diffguard workflow: Reasonable configuration — 10% mutation sample rate, 5m test timeout, tier-based gating.

No further issues found. The PR is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants