fix: timer-based cleanup of listenQueues after transient exporter disconnect#417
fix: timer-based cleanup of listenQueues after transient exporter disconnect#417ambient-code[bot] wants to merge 1 commit intomainfrom
Conversation
✅ Deploy Preview for jumpstarter-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@ambient-code please rebase this |
6ec44ae to
d55da90
Compare
…connect When an exporter's Listen() gRPC stream fails with a transient error the queue is no longer deleted immediately. Instead a cleanup timer (default 2 min) is scheduled. If the exporter reconnects before the timer fires, Listen() cancels the timer and inherits the existing queue — ensuring that any router token already buffered there by a concurrent Dial() call is delivered to the reconnected exporter. On clean shutdown (ctx.Done() — lease ended or server stopping) the timer is cancelled and the queue is removed straight away, so there is no memory leak for the normal lifecycle. Fixes #414
d55da90 to
65c93a4
Compare
|
Rebased onto latest
No conflicts during rebase. All CI checks were passing before the rebase; the rebased commit ( Status summary:
|
Summary
Fixes the race condition in
listenQueuescleanup that caused intermittentError: Connection to exporter lostin E2E tests (issue #414).This is the proper follow-up to the revert in #416.
Root Cause
When an exporter's
Listen()gRPC stream exits with a transient error, the queue for that lease must not be deleted immediately — a concurrentDial()call may have already loaded the same queue and be about to (or have already) written a router token into its buffer. If the queue is deleted before the reconnecting exporter callsListen()again, the token is lost and the client times out after 20 s with "Connection to exporter lost".Fix
Instead of cleaning up immediately on stream error, a
time.AfterFunctimer is scheduled forlistenQueueCleanupDelay(default 2 minutes). The reconnect path inListen()cancels this timer vialistenTimers.LoadAndDeletebefore callingLoadOrStore, so the reconnected exporter inherits the existing queue — and any bufferedDialtoken.On clean shutdown (
ctx.Done()— lease ended or server stopping) the timer is cancelled and the queue removed straight away, so there is no memory leak for the normal lifecycle.Changes
controller/internal/service/controller_service.golistenQueueCleanupDelay(var, default 2 min — overridable in tests)listenTimers sync.Mapfield toControllerServiceListen(): cancel pending timer on reconnect; schedule timer on stream error; immediate cleanup onctx.Done()controller/internal/service/controller_service_test.goTestListenQueueTimerCleanup: queue survives transient error; reconnect cancels timer; timer fires when exporter never returnsTestListenQueueCleanShutdown: cleanctx.Done()path removes queue immediatelyTesting
Closes #414
🤖 Generated with Claude Code