Skip to content

feat(sqs): listenless SQSServer for admin-only deployments#740

Open
bootjp wants to merge 1 commit intomainfrom
feat/sqs-admin-listenless
Open

feat(sqs): listenless SQSServer for admin-only deployments#740
bootjp wants to merge 1 commit intomainfrom
feat/sqs-admin-listenless

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 5, 2026

Summary

Make /admin/api/v1/sqs/* reachable on builds that start without --sqsAddress. The admin handlers (list, describe, delete queues) only need the coordinator and store — not the public SigV4 listener — so this PR teaches startSQSServer to construct a listenless *adapter.SQSServer instead of returning (nil, nil) and dropping the admin bridge.

Concretely:

  • startSQSServer no longer short-circuits when sqsAddr is empty. It still skips lc.Listen (no port is opened), but it constructs NewSQSServer(nil, …) and threads the result back through the runtime.
  • (*SQSServer).Run() detects the nil listener and blocks on reaperCtx.Done() instead of calling httpServer.Serve. The reaper and throttle-sweep goroutines still run so retention math behind the admin counters stays correct.
  • main_admin.go's newSqsQueuesSource now sees a non-nil *adapter.SQSServer and wires /admin/api/v1/sqs/*; the if sqsServer == nil { return nil } guard stays as a defensive check for tests that still pass nil.
  • The runtime-struct comment for r.sqsServer updated to reflect that the field is always non-nil after startup.

Why this is "option A" of the prior discussion

We discussed three options earlier:

  • A — listenless construction across all three adapters (SQS / S3 / Dynamo).
  • B — refactor admin to use the catalog/coordinator directly, no adapter dependency. Crosses adapter and admin package boundaries; would warrant a *_proposed_*.md design doc per CLAUDE.md.
  • C — keep current behaviour, document the gating.

You picked A. This PR scopes to SQS only since that was the immediate ask. Symmetric Dynamo/S3 changes can land as a parallel PR if you want; both follow the same structural pattern (drop early-return in start*Server, teach (*XxxServer).Run() to handle nil listener) but each adapter's Run() body has different background work to keep alive (S3 manifest cleanup workers, etc.) so a separate review pass per adapter is the cleaner cut.

Caller / semantic audit

Per the standing rule on semantic-changing edits:

  • adapter/sqs_catalog.go::queueURL already nil-guards s.listen; no other reads of s.listen exist in the adapter package (verified via grep).
  • main.go runtime field: only comment updated. The Go type is unchanged; the only change is that the zero-value pathway is no longer reachable.
  • newSqsQueuesSource still nil-guards its argument; no caller in main.go passes nil any more, but the guard remains for tests that construct admin servers without an SQS adapter.
  • (*SQSServer).Run() now has two terminating paths (reaper-ctx-block and httpServer.Serve return). Stop() cancels the reaper ctx, which both paths observe — same shutdown semantics on either branch.

Tests

  • TestSQSServer_ListenlessRunStopRoundTrip (new) — pins the nil-listener Run/Stop round trip. Constructs NewSQSServer(nil, nil, nil), launches Run() in a goroutine, calls Stop(), and asserts Run() returns nil within 2s. Without the new s.listen == nil branch in Run(), httpServer.Serve(nil) would panic and this test would fail.
  • TestSQSServer_StopShutsDown (existing) — unchanged. Anchors the listening-path round trip so a regression in either branch does not silently swap them.

Self-review (5 lenses)

  1. Data loss — N/A; no storage / Raft / FSM touch.
  2. ConcurrencyRun() adds a second terminating path; both paths observe the same reaperCtx cancellation that Stop() triggers.
  3. Performance — listenless mode adds zero work to the listening path. Listenless construction adds no Raft round-trips on startup.
  4. Data consistency — admin handlers take a fresh nextTxnReadTS per call; the listener's presence never affected snapshot semantics.
  5. Test coverageTestSQSServer_ListenlessRunStopRoundTrip pins the new branch; the existing TestSQSServer_StopShutsDown anchors the listening branch.

Test plan

  • go test -race -count=1 -run 'TestSQS' ./adapter/... (all SQS tests pass)
  • golangci-lint --config=.golangci.yaml run ./... clean
  • go build ./... clean
  • Live manual smoke: start a cluster with --sqsAddress="", confirm /admin/api/v1/sqs/queues returns the catalog list (operator-driven, out of scope for the merge gate)

Refs

  • adapter/sqs_admin.go — admin entrypoints
  • main_admin.go — admin bridge

Before: a build started without --sqsAddress short-circuited
startSQSServer to (nil, nil); newSqsQueuesSource then received
nil and admin.NewServer left /admin/api/v1/sqs/* off the wire,
so operators who wanted admin visibility into queues without
exposing the public SigV4 surface had no way to do that.

After: startSQSServer constructs the SQSServer even when
sqsAddr is empty, passing a nil net.Listener through to
NewSQSServer. (*SQSServer).Run() detects the nil listener and
blocks on reaperCtx.Done() instead of httpServer.Serve, so:

- The public SigV4 HTTP listener is suppressed (matches the
  prior behaviour for that surface).
- The reaper and throttle-sweep goroutines still run, so
  retention math behind the admin counters stays correct.
- The admin bridge in main_admin.go now sees a non-nil
  *adapter.SQSServer and wires /admin/api/v1/sqs/* on the
  wire — admin operations (list, describe, delete) work the
  same way they would on a build with a public listener.

Caller / semantic audit:

- adapter/sqs_catalog.go queueURL nil-guards s.listen
  already; no other reads of s.listen exist.
- main.go runtime field: comment updated to reflect that
  r.sqsServer is always non-nil after startup. The actual
  field type is unchanged; the only change is that its zero-
  value pathway is no longer reachable.
- newSqsQueuesSource still nil-guards its argument as a
  defensive check for test fixtures that pass nil; no
  caller in main.go passes nil any more.
- Dynamo and S3 adapters keep the prior listener-required
  semantics. Symmetric refactor for those is a deliberate
  follow-up — this PR scopes to the immediate ask (SQS
  admin reachable on listenless builds).

Tests: TestSQSServer_ListenlessRunStopRoundTrip pins the
nil-listener Run/Stop round trip — Run() must enter the
reaperCtx-block branch, Stop() must wind down both reaper
and throttle-sweep goroutines, and Run() must return nil
within a 2s deadline. TestSQSServer_StopShutsDown stays
unchanged, anchoring the listening-path round trip alongside
it.

Self-review (5 lenses):
1. Data loss — N/A; no storage / Raft / FSM touch.
2. Concurrency — Run() now has two terminating paths
   (reaperCtx.Done() vs httpServer.Serve return); Stop()
   cancels the reaper ctx, which both paths observe.
3. Performance — listenless mode adds zero work to the
   listening path. Listenless construction adds no Raft
   round-trips on startup.
4. Data consistency — admin handlers continue to take a
   fresh nextTxnReadTS per call; the listener's presence
   never affected snapshot semantics.
5. Test coverage — new TestSQSServer_ListenlessRunStopRoundTrip
   pins the new branch end-to-end; the existing
   TestSQSServer_StopShutsDown anchors the listening branch
   so a regression in either does not silently swap them.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 5, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 33 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da7df1d4-71a0-47d1-ac54-128f31fb6f57

📥 Commits

Reviewing files that changed from the base of the PR and between 81b0202 and edeec0f.

📒 Files selected for processing (4)
  • adapter/sqs.go
  • adapter/sqs_test.go
  • main.go
  • main_sqs.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs-admin-listenless

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @bootjp's task in 3m 14s —— View job


Code Review

  • Gather context and read changed files
  • Review adapter/sqs.go changes
  • Review adapter/sqs_test.go changes
  • Review main.go changes
  • Review main_sqs.go changes
  • Cross-cutting checks (concurrency, correctness, conventions)
  • Post final review

The implementation is sound and clean. The nil-listener branch in Run() is logically tight, the two-goroutine shutdown pattern correctly mirrors startS3Server, and the new test properly pins the regression surface. A few things worth examining:

Behavior change: credential file is now loaded even when --sqsAddress="" (main_sqs.go:47-53)

Previously startSQSServer short-circuited with return nil, nil before ever calling loadSigV4StaticCredentialsFile. Now it always calls the loader. From main_sigv4_creds.go:

if path == "" {
    return nil, nil  // no-op for empty path
}

So operators who omit --sqsCredentialsFile are unaffected. But operators who previously ran with --sqsCredentialsFile=/some/bad/path and --sqsAddress="" — where the bad path was silently ignored because the early-return fired first — will now hit a startup failure. This is arguably better (fail-fast on misconfiguration) but it's an undocumented behavior change. Worth a one-liner in the PR description or a comment in startSQSServer noting that credential loading now runs regardless of whether the listener is opened.

Fix this →

httpServer.Shutdown on an unstarted server (adapter/sqs.go:347-353)

Stop() calls s.httpServer.Shutdown(context.Background()) unconditionally. In listenless mode, httpServer.Serve is never called, so Shutdown runs against a server with zero active listeners and zero connections — it returns immediately. This is safe per Go's implementation, but slightly surprising to a future reader of Stop(). A one-liner like // Shutdown is a no-op when Serve was never called would make the intent explicit without adding overhead.

newSqsQueuesSource comment is now slightly misleading (main_admin.go:182-192)

// Returns nil when sqsServer is nil so admin.NewServer leaves /admin/api/v1/sqs/* off the wire.
func newSqsQueuesSource(sqsServer *adapter.SQSServer) admin.QueuesSource {
    if sqsServer == nil {
        return nil
    }

The doc comment implies this nil path is the SQS-disabled-state branch. After this PR, sqsServer is always non-nil in production; the guard exists only for tests that construct admin servers without a SQS adapter. A small update like "Returns nil only when called from tests that pass nil directly; the production wiring always passes a non-nil *SQSServer." would prevent a future reader from re-introducing the return (nil, nil) early-return thinking the nil branch is the intended disable path.

Fix this →

Minor: time.Sleep in listenless test (adapter/sqs_test.go:226)

time.Sleep(20 * time.Millisecond)
srv.Stop()

For the listenless path this sleep isn't strictly necessary for correctness: if Stop() fires before Run() reaches <-s.reaperCtx.Done(), the context is already canceled when Run() gets there, so it unblocks immediately. The sleep is fine as defensive symmetry with TestSQSServer_StopShutsDown, but a comment noting this would help distinguish the two cases for future readers of the test.

httpServer and full mux initialized in all NewSQSServer calls including listenless

The mux, all 17 handler closures, and httpServer are wired even when listen == nil. The footprint is negligible and avoiding it would complicate the constructor. No action needed — just noting it's a known tradeoff.

Five-lens self-review

  1. Data loss — No Raft/FSM/store touch. Clean. ✅
  2. Concurrency — Both Run() paths observe the same reaperCtx cancellation. Stop() cancels reaperCancel (set in NewSQSServer, not Run()) so a Stop() before Run() is safe. The go test -race run in the test plan is the right gate here. ✅
  3. Performance — Listenless adds zero cost to the hot path; reaper/sweep goroutines start but do no work until messages exist. ✅
  4. Data consistency — Admin handlers take fresh nextTxnReadTS per call; listener presence never affected snapshot semantics. ✅
  5. Test coverage — New test covers the nil-listener Run/Stop round trip. The existing TestSQSServer_StopShutsDown anchors the listening branch so a regression in either direction would be caught. ✅

Summary

Three small items: (1) the credential-loader behavior change deserves a comment or a note in the description, (2) Stop() calling Shutdown on an unstarted server could use a clarifying comment, (3) newSqsQueuesSource's doc comment is now misleading. None are blockers; the core logic is correct and the test coverage is appropriate.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'listenless' mode for the SQS adapter, allowing the SQS server to be initialized without a public HTTP listener. This ensures that SQS admin endpoints remain functional even when the public SigV4 surface is disabled. The changes include updates to the Run method to handle nil listeners, a new regression test for this deployment shape, and modifications to startSQSServer to always instantiate the server. Feedback was provided regarding the silent ignoring of errors when closing the network listener, suggesting that these should be logged to improve visibility into potential resource leaks.

Comment thread main_sqs.go
if err != nil {
_ = sqsL.Close()
if sqsL != nil {
_ = sqsL.Close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error from sqsL.Close() is silently ignored. According to the general rules, errors from Close() methods on resources like network listeners should be logged to ensure that resource leaks or other cleanup problems are visible to operators.

References
  1. Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edeec0ff3d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_sqs.go
Comment on lines 47 to +51
staticCreds, err := loadSigV4StaticCredentialsFile(credentialsFile, "sqs")
if err != nil {
_ = sqsL.Close()
if sqsL != nil {
_ = sqsL.Close()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip SQS credentials validation in listenless mode

When --sqsAddress is empty, this function still parses --sqsCredentialsFile and returns an error on any file issue before constructing the listenless server. That makes admin-only SQS deployments fail to start if the credentials path is stale/malformed, even though admin queue handlers do not use SigV4 credentials. This is a behavior regression from the previous "SQS disabled" path and can unintentionally keep /admin/api/v1/sqs/* unavailable in the exact mode this change is meant to support.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant