Skip to content

discv5: guard Node with a mutex to prevent poll/injectDatagram data race#47

Closed
ch4r10t33r wants to merge 2 commits into
mainfrom
discv5/node-mutex
Closed

discv5: guard Node with a mutex to prevent poll/injectDatagram data race#47
ch4r10t33r wants to merge 2 commits into
mainfrom
discv5/node-mutex

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Collaborator

Root cause

Node is designed as a poll-based, single-threaded system, but callers routinely run poll() in a dedicated thread (the drive loop) while delivering packets via injectDatagram() from a separate thread (e.g. a shared UDP receive loop or a simulation transport that forwards packets between nodes).

Without synchronisation the two paths race on pending and challenge_pending (both ArrayListUnmanaged). A concurrent append in injectDatagram can reallocate the backing buffer while expireRequests (called from poll) is mid-iteration. The drive-loop thread can then observe a stale or partially-written items.len, causing swapRemove to be invoked on what appears to be an empty list:

thread NNNNN panic: integer overflow
std/array_list.zig:964:32: if (self.items.len - 1 == i) return self.pop().?;
discv5/node.zig:expireRequests: _ = self.pending.swapRemove(i);

This was the crash observed in zeam when running a multi-node simulation (zeam beam).

Fix

Add a std.Thread.Mutex field mu to Node (zero-initialised at declaration, no API change to init()). Both public entry points that mutate node state acquire the lock for their full duration:

  • poll() – holds mu across recvLoop, expireRequests, and refreshBuckets
  • injectDatagram() – holds mu across the packet-dispatch switch

All internal helpers (handleOrdinary, handleWhoareyou, handleHandshake, expireRequests, sendPing, sendFindNode, …) are private and therefore automatically protected when called from either entry point.

Test plan

  • zig build test --summary all – all 145 tests pass
  • zig fmt --check src/discovery/discv5/node.zig – clean

The build already depends only on zquic (build.zig / build.zig.zon); the
lsquic+BoringSSL vendor tree was dead code. Drop vendor/ and the vendor
path from build.zig.zon; refresh README, UPSTREAM, CI label, and justfile
for zquic-only QUIC.
Node is a single-threaded poll-based design, but callers commonly run
poll() in a dedicated thread while delivering packets via injectDatagram()
from a separate UDP receive thread.  Without synchronisation the two paths
race on `pending` and `challenge_pending` (both ArrayListUnmanaged), which
can trigger an integer-overflow panic inside swapRemove when one thread
observes items.len == 0 mid-operation:

  thread NNNNN panic: integer overflow
  std/array_list.zig:964: if (self.items.len - 1 == i) ...
  discv5/node.zig:expireRequests: _ = self.pending.swapRemove(i);

Add a std.Thread.Mutex field `mu` to Node (zero-initialised, no change to
the init() signature) and acquire it for the full duration of poll() and
injectDatagram().  All internal helpers called from those two entry points
are private and therefore automatically protected.
@ch4r10t33r
Copy link
Copy Markdown
Collaborator Author

Closing — root cause is in zeam's pollLoop (double discv5.poll call), not in zig-ethp2p itself.

@ch4r10t33r ch4r10t33r closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant