Skip to content

db/state: fix Aggregator.Close vs background MergeLoop WaitGroup race#21528

Draft
yperbasis wants to merge 1 commit into
mainfrom
yperbasis/aggregator-close-merge-race
Draft

db/state: fix Aggregator.Close vs background MergeLoop WaitGroup race#21528
yperbasis wants to merge 1 commit into
mainfrom
yperbasis/aggregator-close-merge-race

Conversation

@yperbasis
Copy link
Copy Markdown
Member

@yperbasis yperbasis commented May 30, 2026

Summary

Fixes #21521 — the data race between Aggregator.Close (wg.Wait) and the background MergeLoop goroutine spawned by BuildFiles2 / buildFilesInBackground.

Root cause

Both background-build sites spawn the merge goroutine inside the build goroutine:

a.wg.Add(1)
go func() {            // build goroutine
    defer a.wg.Done()
    ... build ...
    if doMerge {
        go func() { a.MergeLoop(ctx) }()   // merge goroutine — MergeLoop does its own wg.Add
    }
}()                    // build goroutine returns → wg.Done

MergeLoop's wg.Add(1) runs in the merge goroutine, which can be scheduled after the build goroutine's wg.Done() has already dropped the counter to zero. A positive Add from a zero counter, concurrent with Close's wg.Wait(), is sync.WaitGroup reuse — undefined behavior, flagged by -race and by the runtime's own panic: sync: WaitGroup is reused before previous Wait has returned.

It is pre-existing (the structure predates #21499, which only touched commitment_context.go); CI load just made the merge goroutine starve long enough to surface it.

Fix

Register the merge goroutine on wg before spawning it, so the Add happens-before the build goroutine's Done and the counter never reaches zero while a merge is pending. MergeLoop is split into a self-accounting public wrapper (unchanged contract for the external callers — snapshots_cmd, kv_temporal, backend, tests) and a private mergeLoop body used by the two in-file background sites.

Test (TDD)

db/state/aggregator_close_test.go drives the real BuildFiles2(…, doMerge=true) then wg.Wait() (what Close does at aggregator.go:541). The empty-aggregator merge is a clean no-op that touches only agg.wg, so the only possible race is the reported one.

  • Red (before the fix): reproduces the exact trace — MergeLoop wg.Add at aggregator.go:1113 via BuildFiles2.func1.1 at :1057 racing the wg.Wait, plus the runtime WaitGroup is reused panic.
  • Green (after the fix): deterministic — counter path 1→2→1→0, no reuse.

Note on buildFilesInBackground

buildFilesInBackground (aggregator.go:2155) has the identical bug and gets the identical fix, but isn't covered by an isolated test: its early-return guards (hasData, committed-txNum) mean an empty aggregator never reaches the merge spawn, so a focused test would need heavy real-data setup. The full db/state -race suite, which exercises that path, stays green.

Verification

  • go test -race -count=3 -run TestAggregatorCloseWaitsForBackgroundMerge ./db/state/ — pass
  • go test -race ./db/state/ — pass (no race)
  • go test -race -run 'Merge' ./db/test/ (direct MergeLoop callers) — pass
  • make lint — 0 issues
  • make erigon integration — both binaries build

🤖 Generated with Claude Code

BuildFiles2 and buildFilesInBackground spawn the merge goroutine inside the
build goroutine, so MergeLoop's wg.Add could run after the build goroutine's
wg.Done — a positive Add from a zero counter, concurrent with Close's wg.Wait
(sync.WaitGroup reuse, flagged by -race).

Register the merge goroutine on wg before spawning it so that Add happens-before
the build goroutine's Done and the counter never reaches zero with a merge
pending. MergeLoop is split into a self-accounting public wrapper (kept for the
external callers) and a private mergeLoop body used by the in-file background
sites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a sync.WaitGroup reuse race between Aggregator.Close and the background merge goroutines spawned by BuildFiles2 / buildFilesInBackground. Previously, wg.Add(1) was performed inside MergeLoop itself (i.e., on the merge goroutine), which could be scheduled after the parent build goroutine's wg.Done() had already dropped the counter to zero, racing Close's wg.Wait().

Changes:

  • Move a.wg.Add(1) to before the merge goroutine is spawned in both BuildFiles2 and buildFilesInBackground, and call a new private mergeLoop body.
  • Split MergeLoop into a public wrapper (preserving the contract for external callers like snapshots_cmd, kv_temporal, backend) that does its own wg.Add(1)/Done, and a private mergeLoop body used by the in-file background spawn sites.
  • Add db/state/aggregator_close_test.go reproducing the race against an empty aggregator over 64 iterations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
db/state/aggregator.go Pre-register merge goroutines on wg before spawning; split MergeLoop into self-accounting public wrapper + private mergeLoop body.
db/state/aggregator_close_test.go New regression test driving BuildFiles2(..., doMerge=true) and asserting wg.Wait/Close ordering is race-free.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race: Aggregator.Close vs MergeLoop background worker on test teardown

2 participants