fix(keyexchange): demote same-session PILA log to Debug#174
Merged
Conversation
Observed against list-agents (node 179172) on 2026-05-29: a fresh
daemon sends its first PILA, peer replies and trust is established,
but the relayed data plane drops our PILS replies. Peer retransmits
its PILA every ~8 s as a keepalive. Each arrival carries the SAME
X25519 ephemeral (hadCrypto=true, keyChanged=false) and lands well
outside DuplicateHandshakeDebounce (250 ms), so the existing duplicate
gate doesn't catch it — every retransmit fires another 'encrypted
tunnel established' Info log and another tunnel.established bus event
even though structurally nothing was installed. 35 false-positive
'established' lines per peer per 5 minutes in field measurement.
Fix demotes the log to Debug for the same-session case while keeping
the bus event + postInstall hook firing (existing endpoint-refresh
contract pinned by TestDuplicatePILAOutsideDebounceFiresHookAgain).
Mirrors the demotion in HandleUnauthFrame (PILK) for the same reason.
Adds TestSameSessionPILASuppressesInfoButFiresHookAndDebug that pins:
- first PILA still produces an Info 'established' line + hook count 1
- second same-key PILA past debounce: hook count = 2 (endpoint refresh
preserved), 'established' Info line count stays at 1, Debug log
'same-session keepalive' present.
Does NOT change crypto/network behaviour. The asymmetric-recovery
reply path (TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale)
and reply-rate-limit (TestReplyRateLimit*) gates are independent and
remain intact.
…-existing) The fan-out loop reused a single msg buffer across iterations and did the per-iteration copy INSIDE the spawned goroutine. The main goroutine's next msg[0] = byte(i & 0xFF) raced with the previous goroutine's copy(m2, m). Caught by go test -race in the Architecture gates job (report: zz_ipc_write_deadline_test.go:75 read vs :72 write). This was added in 1eff4fa (PILOT-218 write-deadline fix) before this branch existed; every PR opened since then has been failing the race-detector check. Hoisting the copy out of the goroutine fixes it without changing the test's intent (still floods ic.ipcWrite with ipcSendBuffer+10 distinct messages to fill the kernel send buffer). Touched alongside the keyexchange log-spam fix because the same PR job runs both and we can't merge until -race is green.
PR #156 / commit 1eff4fa was titled 'add write deadline to IPC writeLoop and bypass semaphore for CmdHealth', and its commit message correctly described the design (10 s per active write, 3 s drain on Close). The CmdHealth inline-dispatch half landed; the writeLoop SetWriteDeadline half didn't. Result: TestWriteLoopExitsOnWriteDeadline has been failing since #156 merged — first with a -race report (a buffer-reuse bug in the test's fan-out loop, fixed in the prior commit on this branch), then with 'writeLoop did not exit within deadline window' because the deadline the test waits on doesn't actually exist. Every PR opened since #156 has been silently blocked on Architecture gates for this reason. Adds: - ipcWriteTimeout = 10 s (the active-write deadline) - ipcDrainTimeout = 3 s (the Close-drain deadline) - SetWriteDeadline calls in writeLoop's both arms, matching the contract the test expects. SetWriteDeadline errors are deliberately swallowed — net.Pipe ignores the call, and any real socket that doesn't support it is already broken in ways the next Write will surface. Semantically a no-op for the happy path (normal clients read fast, the deadline never trips). For a stalled client it does what PILOT-218 wanted: writeLoop times out → c.Conn.Close() → writeDone closes → every parked ipcWrite caller returns ErrIPCClosed → semaphore drains → daemon is responsive again.
Companion to the prior commit (actually-set the writeLoop deadline). After writeLoop hits ipcWriteTimeout and closes writeDone, ipcWrite still had a sendCh-with-room enqueue path that returned nil — the message would land in the channel, sit there orphaned, and the caller would think it succeeded. The slow-path select did catch writeDone, but only after the buffer filled. Added a non-blocking writeDone check next to the existing c.done fast-fail, so any ipcWrite after writeLoop exits returns ErrIPCClosed immediately regardless of sendCh capacity. Pinned by TestWriteLoopExitsOnWriteDeadline's final assertion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A fresh daemon connected to list-agents (node 179172) via the production registry/beacon was logging `encrypted tunnel established` at Info every ~8 seconds — 35 false-positive 'established' lines for one peer in 5 minutes. The session was actually live the whole time; the noise came from a gap in the duplicate-coalescing logic.
Field observation (2026-05-29)
After PR #155-era code shipped, build my freshly-compiled daemon (`audit-build-e05aa1f7`) and run `pilotctl handshake list-agents` + `pilotctl send-message list-agents --wait`. Logs show:
```
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
... (35 events total, ~one every 8 s)
```
Daemon stats: `tunnel_encryption_success: 44`, `pkts_recv: 0`, `peers: 1 (relay)`. So the AEAD path works but the data plane is dropping our PILS replies. Peer (list-agents) interprets that as us being stale and keeps retransmitting its PILA every ~8 s.
Root cause
`pkg/daemon/keyexchange/handle.go`:
```go
duplicate := hadCrypto && !keyChanged && time.Since(oldPC.CreatedAt) < DuplicateHandshakeDebounce
```
`DuplicateHandshakeDebounce = 250 ms`. An 8-second-cadence retransmit lands well past that, so `duplicate=false`. The frame carries the same X25519 ephemeral though — `keyChanged=false` — so nothing is installed (`!hadCrypto || keyChanged` is false). But the code still fires the full side-effect path: Info log, `tunnel.established` bus event, `PostInstallHook`.
Fix
Distinguish two slow-path cases:
Mirrored in `HandleUnauthFrame` (PILK) for symmetry.
Test
New: `TestSameSessionPILASuppressesInfoButFiresHookAndDebug` — full-flow regression:
Test is not `t.Parallel()` because `slog.Default` is process-global; parallel `SetDefault` calls race.
Existing tests still green:
What does NOT change
Test plan