Skip to content

feat(fleetnode): server-initiated discovery via ControlStream [PR 2/2]#235

Open
ankitgoswami wants to merge 1 commit into
mainfrom
ankitg/discovery-pairing
Open

feat(fleetnode): server-initiated discovery via ControlStream [PR 2/2]#235
ankitgoswami wants to merge 1 commit into
mainfrom
ankitg/discovery-pairing

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented May 14, 2026

PR 2 of a stack — stacks on top of #332 (PR 1: pairing + agent reporting). Layers operator-initiated discovery on top of the pairing + agent-reporting surface; builds on the existing fleetnodepairing.UpsertDiscoveredDevices ingestion path so the agent's batches flow through the registry to the operator's waiting stream.

control

Summary

  • fleetnodecontrol.Registry — single-instance in-memory map of fleet_node_id → active ControlStream + per-command_id event channel. Newest-wins eviction, 64-slot buffered events, atomic droppedEvents counter, plus HasInFlightCommand for command-id binding.
  • FleetNodeGateway.ControlStream — bidi handler. After Hello, registers the stream and pumps outgoing ControlCommand + incoming ControlAck through a side goroutine (buffer 2 to avoid linger on exit).
  • ReportDiscoveredDevices command-id binding — rejects reports with empty or unknown command_id (FailedPrecondition). Only persists when an in-flight server-issued command matches the reporting fleet node.
  • FleetNodeAdmin.DiscoverOnFleetNode — operator-facing server-streaming RPC. Validates CONFIRMED, normalizes IPRange to IPList (cap 4096), rejects MDNS, forwards IPList/Nmap. command_id via id.GenerateID(); payload via proto.Marshal. Bounded by a 5-minute per-command deadline.
  • pairing.protobuf.validate count caps on DiscoverRequest modes (4096 IPs, 256 ports per mode).
  • ipscanner.GenerateIPsFromCIDR — exported for cross-package reuse.
  • middleware/rpc_permissionsDiscoverOnFleetNodefleetnode:manage.
  • discovered_by_fleet_node_id immutable origin attribution — migration 000062 adds the column + index. Set once at first agent report, never cleared. GetActiveUnpairedDiscoveredDevices filters on IS NULL so the legacy cloud-side pairing flow never dials an IP a fleet node reported. UpsertDiscoveredDeviceFromFleetNode carries attribution + the per-row guard against cross-fleet-node overwrites.

Codex Security Review — round-2 findings addressed

  • HIGH — UnpairDevice / RevokeFleetNode were clearing attribution. Split origin from ownership: fleet_node_device tracks ownership; discovered_by_fleet_node_id is immutable origin. Cloud pairing keeps refusing agent-reported IPs after unpair/revoke. (commit 83feac26)
  • MEDIUM — publish/PublishAck channel-close race. Hold the mutex through the non-blocking send so cleanup/Unregister can't close the channel between lookup and send. New TestPublish_RaceWithCleanup exercises the path under -race. (commit 81aedebf)
  • MEDIUM — DiscoverOnFleetNode could wait forever. Wrapped the operator ctx in context.WithTimeout(DiscoverCommandTimeout) (default 5min); returns DeadlineExceeded distinctly from operator cancel. New TestDiscoverOnFleetNode_TimesOutWhenAgentNeverResponds. (commit 81c225c6)

Deliberately deferred

  • HIGH — normalizeDiscoverRequest doesn't validate scan targets (loopback / link-local / multicast / public IPs / hostnames / oversized Nmap CIDRs). Operator with fleetnode:manage is treated as a trusted role; the threat model assumes operators don't deliberately turn their own fleet node into a port scanner. Revisit if/when a less-trusted role gets fleet-node access.

Stack

Test plan

  • cd server && go build ./... && go vet ./...
  • just lint clean (buf, eslint, golangci-lint)
  • DB_PASSWORD=fleet go test -count=1 -race across all affected packages — green
  • CI green

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 14, 2026 18:02
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 14, 2026 18:02
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (9b5bf09537c612bc74af28d389f87f01d9174b0e...493c1575371181334d120665463aadf58e0e06d5, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Fleet-node reports can claim un-attributed discovered devices

  • Category: Network Discovery
  • Location: server/sqlc/queries/fleetnodepairing.sql:57
  • Description: The fleet-node upsert allows a reporting node to update an existing discovered_device whenever discovered_by_fleet_node_id IS NULL, then sets attribution and endpoint fields from the report. Existing rows after migration, and server-local discoveries, are NULL-attributed. A compromised authenticated fleet node with any active discovery command can therefore report the identifier of an unpaired NULL-attributed device and claim/redirect it.
  • Impact: Discovery poisoning and ownership confusion. The original row can be removed from server-local pairing views, attributed to the hostile node, and later paired/routed through the wrong fleet node.
  • Recommendation: On conflict, only allow remote reports to update rows already attributed to the same fleet node. Handle legacy/NULL attribution through an explicit operator-approved claim path or strong immutable identity verification before setting attribution.

[MEDIUM] Remote-discovered devices are stranded from pairing

  • Category: Reliability
  • Location: server/internal/domain/pairing/service.go:1204
  • Description: PairingService.Pair now refuses remote-fleet-node-attributed discovered rows and directs callers to PairDeviceToFleetNode, but the remote discovery path only persists discovered_device rows and streams pairing.Device objects. PairDeviceToFleetNode requires an existing device_id, and the changed unpaired-device queries exclude remote-attributed rows, so newly discovered fleet-node devices have no usable follow-up pairing path.
  • Impact: Operators can see a miner from DiscoverOnFleetNode but cannot pair it unless a device row already exists through some other flow.
  • Recommendation: Add an atomic remote-pair endpoint that accepts the attributed discovered_device/device_identifier, promotes it to device, and inserts fleet_node_device, or return a stable pairable database identifier from remote discovery and expose it in the admin flow.

[LOW] Server accepts remote scan sizes the fleet-node executor rejects

  • Category: Reliability
  • Location: server/internal/handlers/fleetnodeadmin/handler.go:277
  • Description: DiscoverOnFleetNode expands IP ranges up to 4096 targets, matching the new proto cap, but the fleet-node executor hard-caps IP-list commands at 1024. Requests with 1025-4096 targets pass server validation, get dispatched, then fail on the agent and are surfaced as internal discovery failures.
  • Impact: Valid-looking operator requests fail late and waste a command slot; the client gets an internal error instead of a validation error.
  • Recommendation: Align server/proto and agent limits, or have the server enforce the agent’s actual maximum before dispatch. Also map agent BAD_REQUEST acks to InvalidArgument.

Notes

No cryptostealing/pool hijack logic, hardcoded payout addresses, SQL injection, shell command injection, or protobuf wire-format breakage was evident in the reviewed diff. Generated proto/sqlc outputs appear to be included with their source changes.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44d1f14dad

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/discovery.go Outdated
Comment thread server/cmd/fleetnode/discovery.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires up end-to-end fleet-node device discovery reporting and operator-driven device pairing. It introduces new gateway/admin RPCs, persists discovery attribution to the reporting fleet node, and adds an agent-side discovery loop that periodically scans CIDRs and reports discovered devices back to the server.

Changes:

  • Add new RPCs: ReportDiscoveredDevices (gateway) and PairDeviceToFleetNode / UnpairDevice / ListFleetNodeDevices (admin), plus server auth interceptor allowlists.
  • Add DB support for discovery attribution (discovered_device.discovered_by_fleet_node_id) and pairing queries + domain service/store for pairing operations.
  • Add fleetnode agent discovery scanning loop with plugin-backed probing and unit tests.

Reviewed changes

Copilot reviewed 21 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/fleetnodepairing.sql New sqlc queries for discovery upsert, pairing/unpairing, and listing pairings.
server/sdk/v1/python/proto_fleet_sdk/generated/pb/driver_pb2_grpc.py Regenerated Python gRPC stubs (version constant change).
server/migrations/000050_add_fleet_node_attribution_to_discovered_device.up.sql Adds nullable discovered_by_fleet_node_id with FK + index for attribution.
server/migrations/000050_add_fleet_node_attribution_to_discovered_device.down.sql Rollback for attribution column, FK, and index.
server/internal/handlers/interceptors/config.go Adds new admin/gateway procedures to auth allowlists.
server/internal/handlers/fleetnodegateway/handler.go Adds ReportDiscoveredDevices handler and injects pairing service into gateway handler.
server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go Updates handler construction to include pairing service dependency.
server/internal/handlers/fleetnodegateway/handler_discovery_test.go New tests for discovery reporting persistence + missing subject rejection.
server/internal/handlers/fleetnodeadmin/handler.go Adds pairing RPC handlers and injects pairing service dependency.
server/internal/handlers/fleetnodeadmin/handler_pairing_test.go New integration tests covering admin pairing/unpair/list and auth gating.
server/internal/domain/stores/sqlstores/fleetnodepairing.go New SQL store implementing pairing/discovery upsert operations.
server/internal/domain/fleetnodepairing/service.go New domain service for pairing/listing and discovery upsert batching with error mapping.
server/internal/domain/fleetnodepairing/models.go New domain models for discovery reports and pairing summaries.
server/internal/domain/fleetnodepairing/integration_test.go Integration tests for service behavior (pair/unpair/list + discovery attribution).
server/generated/sqlc/models.go Regenerated sqlc models to include DiscoveredByFleetNodeID.
server/generated/sqlc/fleetnodepairing.sql.go Regenerated sqlc code for new pairing/discovery queries.
server/generated/sqlc/discovered_device.sql.go Regenerated sqlc code to select new attribution column.
server/generated/sqlc/db.go Regenerated sqlc statement preparation/closing for new queries.
server/generated/grpc/fleetnodegateway/v1/fleetnodegatewayv1connect/fleetnodegateway.connect.go Regenerated Connect stubs for new gateway RPC.
server/generated/grpc/fleetnodeadmin/v1/fleetnodeadminv1connect/fleetnodeadmin.connect.go Regenerated Connect stubs for new admin RPCs.
server/generated/grpc/fleetnodeadmin/v1/fleetnodeadmin.pb.go Regenerated protobuf Go types for pairing RPCs/messages.
server/cmd/fleetnode/run.go Adds discovery flags and runs discovery tick loop alongside heartbeat.
server/cmd/fleetnode/run_test.go Extends stub gateway client to record discovery RPC calls/batches.
server/cmd/fleetnode/fake_gateway_test.go Extends fake gateway server to handle ReportDiscoveredDevices.
server/cmd/fleetnode/discovery.go Implements CIDR expansion, probing, and reporting logic (plugin-backed discoverer).
server/cmd/fleetnode/discovery_test.go New unit tests for CIDR enumeration and discovery tick behavior/error handling.
server/cmd/fleetd/main.go Wires new pairing service into server handlers.
proto/fleetnodegateway/v1/fleetnodegateway.proto Adds ReportDiscoveredDevices RPC + request/response/messages.
proto/fleetnodeadmin/v1/fleetnodeadmin.proto Adds pairing/admin RPCs + request/response/messages.
client/src/protoFleet/api/generated/fleetnodegateway/v1/fleetnodegateway_pb.ts Regenerated TS API types/service for new gateway RPC.
client/src/protoFleet/api/generated/fleetnodeadmin/v1/fleetnodeadmin_pb.ts Regenerated TS API types/service for new admin RPCs.

Comment thread server/cmd/fleetnode/discovery.go Outdated
Comment thread server/cmd/fleetnode/discovery.go Outdated
@ankitgoswami ankitgoswami marked this pull request as draft May 14, 2026 18:10
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 9839915 to 0f7b702 Compare May 14, 2026 22:37
@ankitgoswami ankitgoswami changed the title feat(fleetnode): discovery + pairing E2E (server + agent) feat(fleetnode): discovery + pairing E2E (server) May 18, 2026
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from b4d4ad4 to be68841 Compare May 19, 2026 21:01
@github-actions github-actions Bot removed javascript Pull requests that update javascript code client shared labels May 19, 2026
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from e5c338a to 717ac76 Compare May 27, 2026 17:34
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 65c5fa2 to e48be40 Compare May 28, 2026 17:26
ankitgoswami added a commit that referenced this pull request May 28, 2026
Three findings from a fresh re-review of PR #235 after the server/agent split (the original Codex inline comments on this PR target a file that now lives on a different branch).

- ProcedurePermissions catalog drift: Pair/Unpair/ListFleetNodeDevices/DiscoverOnFleetNode were listed as "UNIMPLEMENTED STUB" in ProceduresPendingMigration even though their handlers are fully implemented and gated via RequirePermission. The contract test reads this map as the source of truth, so a regression that dropped the gate would have gone unnoticed. Moved all four entries into ProcedurePermissions with the right key (manage / read).
- Unbounded IPList / ports counts in DiscoverOnFleetNode: an operator with fleetnode:manage could submit 1M IPs * 10k ports. Added (buf.validate.field).repeated.max_items on the proto (4096 on ip_addresses, 256 on ports in all three modes) plus defense-in-depth maxIPListEntries / maxPortsPerMode checks in normalizeDiscoverRequest so the limits hold even if the validator interceptor is misconfigured.
- Silent event drop in fleetnodecontrol.Registry: the non-blocking publish to a 16-slot buffer dropped batches silently when the operator stream fell behind. Bumped the buffer to 64 and added an atomic dropped-event counter exposed via Registry.DroppedEvents so callers and tests have a signal that batches were lost.

Two items deliberately deferred (see PR description): RFC1918 / private-range gating on discovery targets, and surfacing the dropped-batch count to the operator UX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added javascript Pull requests that update javascript code client shared labels May 28, 2026
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 8f24a99 to 9254c2d Compare May 28, 2026 20:54
ankitgoswami added a commit that referenced this pull request May 28, 2026
Three findings from a fresh re-review of PR #235 after the server/agent split (the original Codex inline comments on this PR target a file that now lives on a different branch).

- ProcedurePermissions catalog drift: Pair/Unpair/ListFleetNodeDevices/DiscoverOnFleetNode were listed as "UNIMPLEMENTED STUB" in ProceduresPendingMigration even though their handlers are fully implemented and gated via RequirePermission. The contract test reads this map as the source of truth, so a regression that dropped the gate would have gone unnoticed. Moved all four entries into ProcedurePermissions with the right key (manage / read).
- Unbounded IPList / ports counts in DiscoverOnFleetNode: an operator with fleetnode:manage could submit 1M IPs * 10k ports. Added (buf.validate.field).repeated.max_items on the proto (4096 on ip_addresses, 256 on ports in all three modes) plus defense-in-depth maxIPListEntries / maxPortsPerMode checks in normalizeDiscoverRequest so the limits hold even if the validator interceptor is misconfigured.
- Silent event drop in fleetnodecontrol.Registry: the non-blocking publish to a 16-slot buffer dropped batches silently when the operator stream fell behind. Bumped the buffer to 64 and added an atomic dropped-event counter exposed via Registry.DroppedEvents so callers and tests have a signal that batches were lost.

Two items deliberately deferred (see PR description): RFC1918 / private-range gating on discovery targets, and surfacing the dropped-batch count to the operator UX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 1d79036 to bc2bb30 Compare May 28, 2026 21:48
@ankitgoswami ankitgoswami changed the title feat(fleetnode): discovery + pairing E2E (server) feat(fleetnode): server-initiated discovery via ControlStream [PR 2/2] May 28, 2026
@ankitgoswami ankitgoswami changed the base branch from main to ankitg/fleetnode-pairing May 28, 2026 21:48
ankitgoswami added a commit that referenced this pull request May 28, 2026
Address security review on PR #332: the validator allows empty scheme but rejects "virtual", the scheme the virtual plugin emits (plugin/virtual/internal/driver/driver.go:160,186). Every legitimate virtual-plugin discovery report currently fails validation.

- Add "virtual" to the allowlist so virtual-plugin reports round-trip cleanly.
- Drop "" from the allowlist — empty was an undocumented placeholder, not a graceful default. The agent's plugin driver always knows the scheme at probe time.
- Tests: TestUpsertDiscoveredDevices_AcceptsVirtualScheme (new positive case), TestUpsertDiscoveredDevices_RejectsEmptyScheme (new negative case), existing RejectsDisallowedScheme (ftp) untouched.

The other two findings from the same review (command_id binding, attribution-based cloud-pairing quarantine) live in PR #235.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 28, 2026
…and_id

Address two security-review findings on PR #235:

1. Discovery reports must bind to a server-issued command. Previously ReportDiscoveredDevices persisted any authenticated agent's report; an attacker (or buggy agent) could spam the org's discovered_device table without operator initiation. Now the handler rejects with FailedPrecondition when command_id is empty or not currently in flight in the registry. Added Registry.HasInFlightCommand getter scoped per fleet_node_id; the existing newest-wins + per-fleet-node command map already provides the ownership invariant we need.

2. Agent-reported rows must not leak into cloud-side pairing. Restored the discovered_by_fleet_node_id column (migration 000062), the SetDiscoveredDeviceAttributionForDevice / ClearAttributionForFleetNode queries, the corresponding store + service methods, and the dd.discovered_by_fleet_node_id IS NULL filter on GetActiveUnpairedDiscoveredDevices / Count. Cloud-side PairDevices short-circuits on dd.DiscoveredByFleetNodeID != nil so the server never dials an IP a fleet node reported. UpsertDiscoveredDeviceFromFleetNode keeps the same per-fleet-node NOT EXISTS pairing guard plus the per-row attribution guard so another fleet node can't overwrite an existing row.

Tests:
- handler_discovery_test.go: replaced PersistsRows with RejectsMissingCommandID + RejectsUnknownCommandID (the happy-path persistence is already covered end-to-end by PublishesBatchToInFlightCommand).
- fleetnodepairing/integration_test.go: restored the attribution-era tests we deleted in the prior simplification commit (Reconciles, AttributesFleetNode, RejectsReportFromOtherFleetNode, PairUnpair_SyncsDiscoveredDeviceAttribution, RevokeClearsPairingsAndAttribution renamed back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from bc2bb30 to 46ce86b Compare May 28, 2026 22:50
ankitgoswami added a commit that referenced this pull request May 29, 2026
…and_id

Address two security-review findings on PR #235:

1. Discovery reports must bind to a server-issued command. Previously ReportDiscoveredDevices persisted any authenticated agent's report; an attacker (or buggy agent) could spam the org's discovered_device table without operator initiation. Now the handler rejects with FailedPrecondition when command_id is empty or not currently in flight in the registry. Added Registry.HasInFlightCommand getter scoped per fleet_node_id; the existing newest-wins + per-fleet-node command map already provides the ownership invariant we need.

2. Agent-reported rows must not leak into cloud-side pairing. Restored the discovered_by_fleet_node_id column (migration 000062), the SetDiscoveredDeviceAttributionForDevice / ClearAttributionForFleetNode queries, the corresponding store + service methods, and the dd.discovered_by_fleet_node_id IS NULL filter on GetActiveUnpairedDiscoveredDevices / Count. Cloud-side PairDevices short-circuits on dd.DiscoveredByFleetNodeID != nil so the server never dials an IP a fleet node reported. UpsertDiscoveredDeviceFromFleetNode keeps the same per-fleet-node NOT EXISTS pairing guard plus the per-row attribution guard so another fleet node can't overwrite an existing row.

Tests:
- handler_discovery_test.go: replaced PersistsRows with RejectsMissingCommandID + RejectsUnknownCommandID (the happy-path persistence is already covered end-to-end by PublishesBatchToInFlightCommand).
- fleetnodepairing/integration_test.go: restored the attribution-era tests we deleted in the prior simplification commit (Reconciles, AttributesFleetNode, RejectsReportFromOtherFleetNode, PairUnpair_SyncsDiscoveredDeviceAttribution, RevokeClearsPairingsAndAttribution renamed back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from a4130d5 to 63852e0 Compare May 29, 2026 17:04
ankitgoswami added a commit that referenced this pull request May 29, 2026
Adds the server-side surfaces operators need to manage fleet-node device pairings and the gateway endpoint agents call to report devices they discovered on their LAN. This is PR 1 of a stack — PR 2 (open as #235) layers server-initiated discovery via ControlStream on top.

What's in this PR:
- fleetnodepairing domain (Service + Store) with PairDevice, UnpairDevice, ListPairs, ListDevicesForFleetNode, UpsertDiscoveredDevices, plus IP/port/scheme validation on agent-reported devices.
- fleet_node_device pairing table queries and UpsertDiscoveredDeviceFromFleetNode with a NOT EXISTS guard that prevents fleet node B from overwriting a device already paired with fleet node A.
- FleetNodeGateway.ReportDiscoveredDevices RPC: agents authenticated via fleetnodeauth submit batches of devices; ip_address must be RFC1918/RFC4193, port 1-65535, url_scheme http or https.
- FleetNodeAdmin.PairDeviceToFleetNode, UnpairDevice, ListFleetNodeDevices RPCs, gated by fleetnode:manage / fleetnode:read via middleware.RequirePermission.
- RevocationCleanupStore extracted from fleetnodeenrollment.Store so RevokeFleetNode now deletes the fleet node's pairings as part of the same TX.
- Integration tests for the pairing CRUD round trip, double-pair rejection, soft-deleted/pending node rejection, cross-org isolation, agent-report validation (invalid IP, port, scheme, non-private ranges, RFC4193 IPv6), the NOT EXISTS pairing guard, and revoke-clears-pairings.

What's deferred to PR 2:
- fleetnodecontrol.Registry (in-memory ControlStream + per-command_id event dispatch).
- FleetNodeGateway.ControlStream bidi handler.
- FleetNodeAdmin.DiscoverOnFleetNode operator-initiated discovery, plus the proto-level max_items caps on DiscoverRequest modes.
- The command_id correlation hook in ReportDiscoveredDevices that fans batches to the operator's waiting stream.

Build, vet, lint, and tests for middleware, fleetnodeadmin, fleetnodegateway, fleetnodepairing, and fleetnodeenrollment are green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
Address security review on PR #332: the validator allows empty scheme but rejects "virtual", the scheme the virtual plugin emits (plugin/virtual/internal/driver/driver.go:160,186). Every legitimate virtual-plugin discovery report currently fails validation.

- Add "virtual" to the allowlist so virtual-plugin reports round-trip cleanly.
- Drop "" from the allowlist — empty was an undocumented placeholder, not a graceful default. The agent's plugin driver always knows the scheme at probe time.
- Tests: TestUpsertDiscoveredDevices_AcceptsVirtualScheme (new positive case), TestUpsertDiscoveredDevices_RejectsEmptyScheme (new negative case), existing RejectsDisallowedScheme (ftp) untouched.

The other two findings from the same review (command_id binding, attribution-based cloud-pairing quarantine) live in PR #235.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-pairing branch from 35e7a64 to ad35e6c Compare May 29, 2026 17:50
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch 2 times, most recently from f079e46 to 6e78884 Compare May 29, 2026 18:40
Base automatically changed from ankitg/fleetnode-pairing to main May 29, 2026 19:39
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 43337b0 to faa18a1 Compare May 29, 2026 19:43
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from faa18a1 to 493c157 Compare May 29, 2026 19:59
@ankitgoswami ankitgoswami marked this pull request as ready for review May 29, 2026 20:03
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 493c157537

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

message IPListModeRequest {
// List of IP addresses (IPv4, IPv6, or hostnames) to check
repeated string ip_addresses = 1;
repeated string ip_addresses = 1 [(buf.validate.field).repeated.max_items = 4096];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align advertised scan limits with agent caps

When DiscoverOnFleetNode is used with an IP list/range above 1024 targets (or more than 10 ports), the request passes server-side validation because this proto cap is 4096 and the port caps are 256, but the fleetnode agent rejects it in server/cmd/fleetnode/control.go via maxIPsPerCommand = 1024 and maxPortsPerIP = 10. That makes operator requests that are valid according to the API consistently fail with an agent BAD_REQUEST ack instead of being rejected up front or chunked; keep these limits in sync or split the command before sending it to the agent.

Useful? React with 👍 / 👎.

}
}
if ev.Ack != nil {
if msg := ev.Ack.GetErrorMessage(); !ev.Ack.GetSucceeded() && msg != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat every failed ack as command failure

If a fleet node sends ControlAck{Succeeded:false} with an empty error_message (for example from a buggy/custom agent or a failure path that only sets code), this condition is false and the operator stream returns success even though the command explicitly failed. The handler should key off !Succeeded regardless of whether a message string is present, otherwise failed scans can be reported as successful with no batches or partial results.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants