fleetnode: server-driven discovery via ControlStream (stack 3/3)#324
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc9502fcf4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Fleet node executes server-supplied discovery targets without a local allowlist
[MEDIUM]
|
2226f77 to
3694b1f
Compare
bc9502f to
99da7c8
Compare
3694b1f to
dc50f50
Compare
2ecc137 to
ce5370e
Compare
ce5370e to
bb5c562
Compare
bb5c562 to
83c42e7
Compare
e31e9ee to
b86dc70
Compare
fbfb62d to
3e0c139
Compare
66ef99d to
38dd55c
Compare
There was a problem hiding this comment.
Pull request overview
PR 3 of 3 completes the fleet-node agent's server-driven discovery surface by wiring up the bidirectional ControlStream, adding nmap support, and centralizing IP-list normalization for reuse between the agent and the existing pairing service.
Changes:
- New
runControlLoop/runControlSession/handleCommandincontrol.gothat consumes serializedpairing.v1.DiscoverRequestcommands, dispatches IPList/IPRange/Nmap discovery, batches reports throughReportDiscoveredDevices, and emits typedAckCodeacknowledgements (with proto + generated Go/TS updated for the new enum). - New
nmap.go(with unix/windows ownership-check shims) implementing strict target validation, adjacent-binary safety checks, hostname → IPv4-preferred resolution, IPv6 CIDR rejection, and bounded probe fan-out from open ports. - New shared
netutil.NormalizeIPListEntryconsumed by bothpairing.Service.DiscoverWithIPList(replacing ~45 lines of inline logic) and the agent's IPList command path;run.goaddsnmapPath, the control-loop goroutine, and aWaitGroup/errChshutdown pattern alongside the heartbeat.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| server/cmd/fleetnode/control.go | Adds full ControlStream consumer, command dispatch, port/IP validation, IPv4 range expansion, fan-out probing, batched report uploads, and AckCode plumbing. |
| server/cmd/fleetnode/control_test.go | Extensive new coverage: probe/report happy path, mode rejections, port/IP validation, range expansion, partial-timeout reporting, reconnect/EOF, serialization, ctx-cancel, and stuck-plugin supervisor. |
| server/cmd/fleetnode/nmap.go | New nmap target validation, safe binary resolution, hostname resolution, and scan-to-probe pipeline. |
| server/cmd/fleetnode/nmap_unix.go / nmap_windows.go | Per-OS ownership/permission validation (no-op on Windows). |
| server/cmd/fleetnode/nmap_test.go | Target grammar, binary resolution safety, hostname substitution, IPv6 flagging, and CIDR rejection tests. |
| server/cmd/fleetnode/run.go | Adds nmapPath, resolver fields; runs heartbeat + control loop concurrently when a discoverer is available. |
| server/cmd/fleetnode/README.md | Documents control stream, nmap mode, security caveats, and troubleshooting entries. |
| server/internal/domain/netutil/iplist.go(_test.go) | New shared NormalizeIPListEntry helper with full unit-test coverage. |
| server/internal/domain/pairing/service.go | DiscoverWithIPList delegates to the new shared normalizer. |
| proto/fleetnodegateway/v1/fleetnodegateway.proto | Adds AckCode enum and ControlAck.code field. |
| server/generated/grpc/.../fleetnodegateway.pb.go | Regenerated Go bindings for the new enum/field. |
| client/src/protoFleet/api/generated/.../fleetnodegateway_pb.ts | Regenerated TS bindings for the new enum/field. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab56c775e8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f269298d2a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
PR 3/3 of the fleetnode agent stack. Completes the agent surface for server-driven discovery: bidi ControlStream consumer, nmap- and IpList-based scan paths, plugin-manager probe orchestration, and report streaming back through ReportDiscoveredDevices. Original implementation plus all review feedback. Control loop. - runControlLoop opens a bidi ControlStream and reconnects with exponential backoff (1s -> 30s) plus half-jitter. CodeNotFound and ErrBeginAuthRejected are treated as permanent; CodeUnimplemented drops the agent to heartbeat-only against pre-ControlStream servers. stableSessionThreshold (30s) resets the backoff once a session stabilises so flapping connections keep growing. - runControlSession does Hello/Accepted and then offloads command execution to a single-slot commandWorker goroutine. The dispatch channel is buffer=1; if the worker is already busy with a queued command, the receive loop ack-busies the new arrival rather than parking on a full channel and hiding stream events. - Worker runs on a session-scoped ctx so a dropped stream cancels the in-flight scan immediately. Without this the agent would burn up to commandTimeout (10 min) finishing an old scan before opening a new session. - All ControlStreamRequest sends go through a lockedAcker wrapper so the worker's completion ack and the receive loop's busy ack can't race on connect-go's bidi stream. - A watcher goroutine closes the bidi stream when the parent ctx fires; required because stream.Receive parks in http2.pipe on a sync.Cond that ctx alone can't unblock. Command dispatch. - discoverForCommand routes by oneof: IpList -> NormalizeIPListEntry + bounded plugin probes (probeConcurrency=32, perProbeTimeout=10s matching server-side pairing.perDeviceDiscoveryTimeout); IpRange -> expandIPv4Range with the same .0/.1 skip the server applies; Nmap -> validateNmapTarget + buildNmapOptions + Ullaakut/nmap/v3 with 16-way probe concurrency; Mdns -> rejected as AGENT_INCAPABLE. - Untrusted-server input is treated as adversarial: resolveAndValidatePorts emits canonical decimals (strconv.Itoa) so non-canonical inputs like "+80" or "080" don't poison reports; multi-octet nmap ranges (10.0-255.0-255.1-254) are rejected before the hostname fallback so they can't slip past the /22 CIDR cap; IPv4-mapped IPv6 collapses to v4 in NormalizeIPListEntry so v4-only checks see v4. - maxIPsPerCommand=1024, maxPortsPerIP=10 mirror server-side caps. commandTimeout=10m overall; per-batch report upload bounded at 30s. - fanOutProbes has a wall-clock supervisor (perProbeTimeout*2) so a plugin Probe that ignores ctx can't pin the agent. - Per-device protovalidate runs before each report is added to the batch; one bad device-supplied url_scheme/driver_name no longer fails the whole ReportDiscoveredDevices batch. - Scanned (ip, port) overrides anything the plugin reports for those fields so a malicious plugin can't spoof endpoints. - Truncated scans (cmdCtx hit commandTimeout) still upload their partial reports and ack PROBE_PARTIAL via the daemon ctx; the intent is operators see partial discovery rather than nothing. Nmap binary safety. - resolveNmapPath prefers an adjacent <exe-dir>/nmap and fails closed if that candidate is unsafe (no PATH fall-through past a staged binary). PATH fallback resolves once via LookPath. - validateNmapBinary returns the symlink-resolved path so a symlink swap after startup can't redirect exec to an unvalidated binary. - Both candidates pass validatePathChain (no writable ancestor without sticky bit), mirroring the plugin loader. - Empty nmap path fails closed in runNmapDiscovery rather than letting the library re-resolve via PATH at scan time. - POSIX-mode safety (uid + write bits + exec bit) lives in checkNmapBinaryOwnership on Unix; Windows .exe doesn't model these bits, so the windows build of the helper is a no-op. Wire contract. - ControlAck.code is an enum: OK, PARTIAL, BAD_REQUEST, AGENT_INCAPABLE, SCAN_FAILED, REPORT_FAILED, INTERNAL. Codes track consumer-visible distinctions only; error_message carries the human-readable detail and is utf8-safe-truncated at the proto's 4096-byte max_len. - discoveryReportTimeout per-batch (30s) bounds an individual ReportDiscoveredDevices call; the per-batch context is derived from the daemon ctx so partial-success uploads survive a session drop. - streamReports chunks at maxDevicesPerReport=1024 via slices.Chunk. Shared helpers in netutil. - NormalizeIPListEntry centralises ip_addresses handling (rejects scoped IPv6, link-local fe80::/10, hostname-prefer-IPv4); the pairing service uses it too, replacing ~45 lines of inline logic. - ParseIPv4 / IPv4ToUint32 / Uint32ToIPv4 / AdjustIPv4RangeStart consolidate the four bijection helpers previously duplicated between control.go and pairing/service.go. pairing's DiscoverWithIPRange now uses the shared helpers. - IsIPv4NetworkOrGateway mirrors the server-side .0/.1 filter so agent nmap result rows where a gateway answers for the whole subnet don't poison the batch. Tests. - control_test.go drives the loop end-to-end against a fake bidi gateway (controlFakeGateway) with knobs for permanent-error rejection, post-handshake stream close, report-upload failure, and a signal-based mid-stream close used to assert in-flight cancellation. Table-driven TestControlLoop_AcksAndReports covers the ack-code/report invariants across IpList happy path, IPv6 normalization, all-unusable IPs, over-cap IPs, over-cap ports, mdns rejection, nmap port-range bypass, IpRange expansion, corrupt payload, and report-upload failure in one pass. - Discrete tests for partial-results-survive-scan-deadline, permanent-error propagation, Unimplemented->heartbeat-only, reconnect-after-EOF, long-command-doesn't-block-receive, ctx-cancel-during-in-flight, fanOutProbes supervisor budget, drop-invalid-report-instead-of-poisoning-batch, override-plugin-supplied-endpoint, dropped-stream-cancels-scan, pipelined-command-gets-busy-ack, concurrent-acks-serialize, sendAck UTF-8-boundary truncation. - nmap_test.go covers resolveNmapPath happy path, adjacent unsafe fails closed, world-writable adjacent rejected, world-writable ancestor rejected, validateNmapTarget table (multi-octet ranges rejected, shell metacharacters rejected, etc.), resolveNmapTarget hostname substitution, IPv6 CIDR rejected, nmapPath="" fails closed, IPv6 scanning option assertions on actual scanner.Args(). - iputil_test.go: NormalizeIPListEntry table (scoped/link-local/ trailing-whitespace/IPv4-mapped IPv6 cases), ParseIPv4 + IPv4Uint32 round-trip, AdjustIPv4RangeStart with loopback carve-out, IsIPv4NetworkOrGateway. README updates document the control flow, nmap target rules, discovery security model, and the operator troubleshooting paths. Verification. - go build ./server/... clean - go vet ./server/... clean - go test -race -count=1 -timeout 180s ./server/cmd/fleetnode/... ./server/internal/domain/netutil/... green - golangci-lint on the touched packages 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
abbbf5f to
faf1da1
Compare
The gateway proto restricted DiscoveredDeviceReport.url_scheme to
{"http","https"} via buf-validate; pairing.Device (the pre-existing
source schema in the pairing service) has no such restriction. The
packaged virtual plugin reports URLScheme: "virtual", and just
build-fleetnode includes native plugins, so server-driven discovery of
virtual miners had its report dropped by the agent's protovalidate gate
in fanOutProbes -- silently, with no record reaching the server.
Loosen the gateway constraint to match the source schema. Keep the
field bounded via max_len = 32 (long enough for any real URI scheme,
short enough that a buggy plugin can't sneak a multi-KB string into the
batch).
Regenerate the gateway pb.go and pb.ts.
Adjust TestFanOutProbes_DropsInvalidReportInsteadOfPoisoningBatch:
"virtual" is now valid, so use an oversized model string (>255 chars)
as the proto-rejected case. Add
TestFanOutProbes_AcceptsNonHTTPPluginScheme to lock the new behavior in.
Verification on the touched scope:
- go build clean
- go vet clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim comments that restated what well-named identifiers already say or were verbose where one line suffices. Load-bearing WHY comments (http2.pipe watcher, sessionCtx rationale, lockedAcker concurrency note, UTF-8 truncation rationale, partial-results ctx choice, dev- fixture loopback carve-outs, etc.) are intact. - control.go: drop commandError struct doc, shrink resolveAndValidatePorts doc and inline canonical-form note from 3 lines to 2/1. - control_test.go: drop discoverIPList doc, shrink runControlLoopOnce doc, drop "All fixtures carry url_scheme..." prose, shrink the DroppedStream test setup explanation. - netutil/iputil.go: drop pure-WHAT docs on IPv4ToUint32 / Uint32ToIPv4 (the WHY-non-obvious panic warning stays as a one-liner). No behavior change. Verification on the touched scope: - go build clean - go vet clean - go test -race -count=1 -timeout 180s green - golangci-lint: 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests dropped where the compiler, a stronger sibling test, or the Go language itself already covered the same ground. - TestNormalizeIPListEntry_DefaultResolverSatisfiesInterface Pure `var _ IPListResolver = net.DefaultResolver` compile-time assertion. The compiler runs this on every build; the wrapping test function added nothing. - TestIPv4Uint32RoundTrip Asserted Uint32ToIPv4(IPv4ToUint32(addr)) == addr.String() for a handful of IPs. The encoding is the contract -- TestIPv4ToUint32_ KnownValues and TestUint32ToIPv4_KnownValues pin specific bytes for both directions, which is strictly more informative; a swapped byte order would fool the round-trip property but fail the known-values tables. - TestSendAck_ClampsErrorMessageToProtoLimit Strict subset of TestSendAck_TruncationPreservesUTF8Boundaries, which checks every invariant the ASCII variant did plus utf8. ValidString. - TestBuildNmapOptions_RejectsIPv6CIDR TestResolveNmapTarget already has the `"ipv6 cidr rejected"` case at the layer where the rejection lives; the wrapping test was just verifying Go's `return err` propagation up one call frame. - TestResolveNmapPath_EmptyExeDir_NoPath TestResolveNmapPath_Default_AdjacentNotExecutableFallsBack reaches the same final code path (empty PATH lookup returns ""); the preceding branch (skip-adjacent-when-exeDir-empty) is just an `if` with no logic worth a dedicated test. The substantive coverage stays intact: every business rule, every security path, every regression behavior. No code change, no production behavior change. -74 test lines. Verification on the touched scope: - go build clean - go vet clean - go test -race -count=1 -timeout 180s green - golangci-lint: 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 245600be43
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…indows Two of the four Codex findings from this round; the other two are unchanged-from-prior-rounds product decisions surfaced below. handleCommand now runs protovalidate.Validate(cmd) before using any field. This catches oversized command_id (>128 bytes), empty command_id (min_len=1), and oversized payload (>1 MiB max_len) before they propagate into ReportDiscoveredDevices/ControlAck messages that would themselves fail validation at the gateway. If command_id is the field that's invalid, echoing it back in an ack would also fail validation and close the stream -- in that case the agent drops silently and lets the server time out. New regression queues an empty- command_id command followed by a valid one and asserts only the valid one is acked, proving the receive loop kept running normally past the dropped command. resolveNmapPath disables PATH fallback on Windows. The Windows checkNmapBinaryOwnership is a no-op (Windows ACLs aren't modeled), so any nmap.exe the process resolves on PATH was previously trusted -- an elevated service plus a less-privileged-writable PATH entry would let an attacker swap the binary. New build-tag-split constants nmapBinaryName and nmapAllowPATHFallback: Windows uses "nmap.exe" with no PATH fallback (require an installer-staged adjacent binary under an Administrator-only directory); Unix keeps the existing "nmap" + PATH fallback. Cross-compile verification: GOOS=windows go build clean. Two findings deferred: - [HIGH] CIDR allowlist for probe targets. Fourth review round this has come up; the decision (where the allowlist lives, default behavior, operator UX) isn't a code change I can land unilaterally. Incremental defenses already in this PR: IPv4-mapped IPv6 collapse, /22 nmap CIDR cap, multi-octet nmap range rejection, agent-side .0/.1 result filter. Broader allowlist remains a product decision for a separate PR. - [MEDIUM] Straggler-goroutine accumulation. fanOutProbes already has the perProbeTimeout*2 supervisor that returns the partial batch, but the in-flight goroutine itself keeps running until the plugin RPC returns. Real fix is plugin-manager territory (signal plugin health, kill+restart misbehaving subprocesses); doesn't belong in the control loop. Verification on the touched scope: - go build clean (unix + GOOS=windows cross-compile) - go vet clean - go test -race -count=1 -timeout 180s green - golangci-lint: 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c99010faf0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
cmdCh capacity 1 had a real race: the receive loop can land a second command before the worker has drained the first, so the buffer briefly looks full to the non-blocking send and a perfectly valid command gets busy-acked. TestControlLoop_LongCommandDoesNotBlockNextReceiveAndSerializes hit this nondeterministically. Bump to capacity 2 (1 in flight + 1 queued); a 3rd outstanding command still hits the busy-ack path. TestControlLoop_PipelinedCommandGetsBusyAck adjusted to queue four commands so cmd-D is the one that overflows. Stress: 200 iterations of both tests under -race, all green. Comment cleanup pass on the protovalidate / Windows-nmap / cmdCh additions: the WHY survives, the prose around it gets trimmed (11 lines of comment removed). Verification on the touched scope: - go build clean (unix + GOOS=windows cross-compile) - go test -race -count=1 -timeout 180s green - go test -race -count=200 on the race-sensitive tests green - golangci-lint: 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a plugin Probe ignores its context, the fanOutProbes supervisor budget (perProbeTimeout*2) fires and returns the reports collected so far. Before this change, handleCommand only sent ACK_CODE_PARTIAL when cmdCtx itself hit commandTimeout; supervisor-truncated batches were acked OK and the server had no signal that some endpoints were silently dropped. Thread a truncated bool from waitWithSupervisor -> fanOutProbes -> probeIPsAndPorts / runNmapDiscovery -> discoverForCommand -> handleCommand. handleCommand now branches: ack PARTIAL if cmdCtx deadline exceeded OR truncated, OK otherwise. Two distinct PARTIAL error_message strings so the server can tell command-deadline vs. supervisor-budget apart. New end-to-end TestControlLoop_SupervisorTruncatedScanAcksPartial drives the full bidi flow with a probe that ignores ctx and asserts the agent emits ACK_CODE_PARTIAL with the supervisor reason. The pure-unit TestFanOutProbes_SupervisorReturnsPartialOnStuckPlugin now also asserts the truncated bool. Verification on the touched scope: - go build clean - go test -race -count=1 -timeout 180s green - golangci-lint: 0 issues Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four-line PARTIAL ack rationale -> two lines. fanOutProbes signature doc dropped the bool's "(supervisor fired or ctx cancelled mid-fan-out so the caller can ack PARTIAL)" tail; the variable name "truncated" and the body carry it. Test assertions and arrange/setup comments tightened the same way. ctxIgnoringDiscoverer doc collapsed to one line. Behavior unchanged. -8 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Asserting cmd-D specifically was timing-dependent: whether cmd-D or cmd-C (or cmd-E) lands the busy ack depends on whether the worker goroutine drains cmd-A before the receive loop reads the next command. CI hit the cmd-C-busy-acked path while my laptop favored cmd-D. Queue 5 commands and assert ANY trailing one gets the busy ack. With cmd-A's probe blocking and buffer capacity 2, at least one of cmd-C / cmd-D / cmd-E must overflow regardless of scheduling, so the test is deterministic now. Also bumped the eventually budget 2s -> 3s for CI slack. 200 iterations under -race green for both TestControlLoop_PipelinedCommandGetsBusyAck and TestControlLoop_LongCommandDoesNotBlockNextReceiveAndSerializes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reusing ACK_CODE_INTERNAL for the "agent already has a command in
flight" rejection forced callers to string-match the error_message
("in flight") to distinguish it from real internal errors, and the
test was doing exactly that. Promote it to a first-class code.
ACK_CODE_BUSY = 8 in the proto. The busy-ack site in control.go now
emits the new code; the pipelined-command test asserts on the code
directly and drops the error_message substring check.
Semantics: BUSY is transient (retry after the prior ack); INTERNAL
remains the catch-all for unexpected errors.
Verification on the touched scope:
- go build clean
- go test -race -count=1 -timeout 180s green
- 200 iterations under -race on the pipelined + long-command tests
- golangci-lint: 0 issues
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PR 3 of 3. Stacked on #323. Completes the agent surface: `ControlStream` consumer, nmap, IPList normalizer, control-loop wiring in `run.go`.
Test plan
Stack
🤖 Generated with Claude Code