discv5: guard Node with a mutex to prevent poll/injectDatagram data race#47
Closed
ch4r10t33r wants to merge 2 commits into
Closed
discv5: guard Node with a mutex to prevent poll/injectDatagram data race#47ch4r10t33r wants to merge 2 commits into
ch4r10t33r wants to merge 2 commits into
Conversation
The build already depends only on zquic (build.zig / build.zig.zon); the lsquic+BoringSSL vendor tree was dead code. Drop vendor/ and the vendor path from build.zig.zon; refresh README, UPSTREAM, CI label, and justfile for zquic-only QUIC.
Node is a single-threaded poll-based design, but callers commonly run poll() in a dedicated thread while delivering packets via injectDatagram() from a separate UDP receive thread. Without synchronisation the two paths race on `pending` and `challenge_pending` (both ArrayListUnmanaged), which can trigger an integer-overflow panic inside swapRemove when one thread observes items.len == 0 mid-operation: thread NNNNN panic: integer overflow std/array_list.zig:964: if (self.items.len - 1 == i) ... discv5/node.zig:expireRequests: _ = self.pending.swapRemove(i); Add a std.Thread.Mutex field `mu` to Node (zero-initialised, no change to the init() signature) and acquire it for the full duration of poll() and injectDatagram(). All internal helpers called from those two entry points are private and therefore automatically protected.
Collaborator
Author
|
Closing — root cause is in zeam's pollLoop (double discv5.poll call), not in zig-ethp2p itself. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
Nodeis designed as a poll-based, single-threaded system, but callers routinely runpoll()in a dedicated thread (the drive loop) while delivering packets viainjectDatagram()from a separate thread (e.g. a shared UDP receive loop or a simulation transport that forwards packets between nodes).Without synchronisation the two paths race on
pendingandchallenge_pending(bothArrayListUnmanaged). A concurrentappendininjectDatagramcan reallocate the backing buffer whileexpireRequests(called frompoll) is mid-iteration. The drive-loop thread can then observe a stale or partially-writtenitems.len, causingswapRemoveto be invoked on what appears to be an empty list:This was the crash observed in zeam when running a multi-node simulation (
zeam beam).Fix
Add a
std.Thread.MutexfieldmutoNode(zero-initialised at declaration, no API change toinit()). Both public entry points that mutate node state acquire the lock for their full duration:poll()– holdsmuacrossrecvLoop,expireRequests, andrefreshBucketsinjectDatagram()– holdsmuacross the packet-dispatch switchAll internal helpers (
handleOrdinary,handleWhoareyou,handleHandshake,expireRequests,sendPing,sendFindNode, …) are private and therefore automatically protected when called from either entry point.Test plan
zig build test --summary all– all 145 tests passzig fmt --check src/discovery/discv5/node.zig– clean