db/state: fix Aggregator.Close vs background MergeLoop WaitGroup race#21528
Draft
yperbasis wants to merge 1 commit into
Draft
db/state: fix Aggregator.Close vs background MergeLoop WaitGroup race#21528yperbasis wants to merge 1 commit into
yperbasis wants to merge 1 commit into
Conversation
BuildFiles2 and buildFilesInBackground spawn the merge goroutine inside the build goroutine, so MergeLoop's wg.Add could run after the build goroutine's wg.Done — a positive Add from a zero counter, concurrent with Close's wg.Wait (sync.WaitGroup reuse, flagged by -race). Register the merge goroutine on wg before spawning it so that Add happens-before the build goroutine's Done and the counter never reaches zero with a merge pending. MergeLoop is split into a self-accounting public wrapper (kept for the external callers) and a private mergeLoop body used by the in-file background sites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a sync.WaitGroup reuse race between Aggregator.Close and the background merge goroutines spawned by BuildFiles2 / buildFilesInBackground. Previously, wg.Add(1) was performed inside MergeLoop itself (i.e., on the merge goroutine), which could be scheduled after the parent build goroutine's wg.Done() had already dropped the counter to zero, racing Close's wg.Wait().
Changes:
- Move
a.wg.Add(1)to before the merge goroutine is spawned in bothBuildFiles2andbuildFilesInBackground, and call a new privatemergeLoopbody. - Split
MergeLoopinto a public wrapper (preserving the contract for external callers likesnapshots_cmd,kv_temporal,backend) that does its ownwg.Add(1)/Done, and a privatemergeLoopbody used by the in-file background spawn sites. - Add
db/state/aggregator_close_test.goreproducing the race against an empty aggregator over 64 iterations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| db/state/aggregator.go | Pre-register merge goroutines on wg before spawning; split MergeLoop into self-accounting public wrapper + private mergeLoop body. |
| db/state/aggregator_close_test.go | New regression test driving BuildFiles2(..., doMerge=true) and asserting wg.Wait/Close ordering is race-free. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #21521 — the data race between
Aggregator.Close(wg.Wait) and the backgroundMergeLoopgoroutine spawned byBuildFiles2/buildFilesInBackground.Root cause
Both background-build sites spawn the merge goroutine inside the build goroutine:
MergeLoop'swg.Add(1)runs in the merge goroutine, which can be scheduled after the build goroutine'swg.Done()has already dropped the counter to zero. A positiveAddfrom a zero counter, concurrent withClose'swg.Wait(), issync.WaitGroupreuse — undefined behavior, flagged by-raceand by the runtime's ownpanic: sync: WaitGroup is reused before previous Wait has returned.It is pre-existing (the structure predates #21499, which only touched
commitment_context.go); CI load just made the merge goroutine starve long enough to surface it.Fix
Register the merge goroutine on
wgbefore spawning it, so theAddhappens-before the build goroutine'sDoneand the counter never reaches zero while a merge is pending.MergeLoopis split into a self-accounting public wrapper (unchanged contract for the external callers —snapshots_cmd,kv_temporal,backend, tests) and a privatemergeLoopbody used by the two in-file background sites.Test (TDD)
db/state/aggregator_close_test.godrives the realBuildFiles2(…, doMerge=true)thenwg.Wait()(whatClosedoes ataggregator.go:541). The empty-aggregator merge is a clean no-op that touches onlyagg.wg, so the only possible race is the reported one.MergeLoopwg.Addataggregator.go:1113viaBuildFiles2.func1.1at:1057racing thewg.Wait, plus the runtimeWaitGroup is reusedpanic.1→2→1→0, no reuse.Note on
buildFilesInBackgroundbuildFilesInBackground(aggregator.go:2155) has the identical bug and gets the identical fix, but isn't covered by an isolated test: its early-return guards (hasData, committed-txNum) mean an empty aggregator never reaches the merge spawn, so a focused test would need heavy real-data setup. The fulldb/state-racesuite, which exercises that path, stays green.Verification
go test -race -count=3 -run TestAggregatorCloseWaitsForBackgroundMerge ./db/state/— passgo test -race ./db/state/— pass (no race)go test -race -run 'Merge' ./db/test/(directMergeLoopcallers) — passmake lint— 0 issuesmake erigon integration— both binaries build🤖 Generated with Claude Code