Skip to content

fix: timer-based cleanup of listenQueues after transient exporter disconnect#417

Open
ambient-code[bot] wants to merge 1 commit intomainfrom
fix-listen-queue-race
Open

fix: timer-based cleanup of listenQueues after transient exporter disconnect#417
ambient-code[bot] wants to merge 1 commit intomainfrom
fix-listen-queue-race

Conversation

@ambient-code
Copy link
Copy Markdown
Contributor

@ambient-code ambient-code bot commented Apr 7, 2026

Summary

Fixes the race condition in listenQueues cleanup that caused intermittent Error: Connection to exporter lost in E2E tests (issue #414).

This is the proper follow-up to the revert in #416.

Root Cause

When an exporter's Listen() gRPC stream exits with a transient error, the queue for that lease must not be deleted immediately — a concurrent Dial() call may have already loaded the same queue and be about to (or have already) written a router token into its buffer. If the queue is deleted before the reconnecting exporter calls Listen() again, the token is lost and the client times out after 20 s with "Connection to exporter lost".

Fix

Instead of cleaning up immediately on stream error, a time.AfterFunc timer is scheduled for listenQueueCleanupDelay (default 2 minutes). The reconnect path in Listen() cancels this timer via listenTimers.LoadAndDelete before calling LoadOrStore, so the reconnected exporter inherits the existing queue — and any buffered Dial token.

On clean shutdown (ctx.Done() — lease ended or server stopping) the timer is cancelled and the queue removed straight away, so there is no memory leak for the normal lifecycle.

Transient error path:
  Listen() stream error
    → schedule cleanup timer (2 min)
    → return error
  Exporter reconnects within 2 min:
    Listen() → cancel timer → LoadOrStore(existing queue) → reads Dial token ✓
  Exporter gone for > 2 min:
    timer fires → queue deleted (bounded leak) ✓

Clean shutdown path:
  ctx.Done() fires
    → cancel any timer
    → delete queue immediately ✓

Changes

  • controller/internal/service/controller_service.go
    • Add listenQueueCleanupDelay (var, default 2 min — overridable in tests)
    • Add listenTimers sync.Map field to ControllerService
    • Listen(): cancel pending timer on reconnect; schedule timer on stream error; immediate cleanup on ctx.Done()
  • controller/internal/service/controller_service_test.go
    • TestListenQueueTimerCleanup: queue survives transient error; reconnect cancels timer; timer fires when exporter never returns
    • TestListenQueueCleanShutdown: clean ctx.Done() path removes queue immediately

Testing

=== RUN   TestListenQueueTimerCleanup
--- PASS: TestListenQueueTimerCleanup (0.40s)
=== RUN   TestListenQueueCleanShutdown
--- PASS: TestListenQueueCleanShutdown (0.00s)

Closes #414

🤖 Generated with Claude Code

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 7, 2026

Deploy Preview for jumpstarter-docs ready!

Name Link
🔨 Latest commit 65c93a4
🔍 Latest deploy log https://app.netlify.com/projects/jumpstarter-docs/deploys/69d6845f3e4fc8000820ac6a
😎 Deploy Preview https://deploy-preview-417--jumpstarter-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 662e6785-1034-4f55-b861-962f3b4035a7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-listen-queue-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mangelajo
Copy link
Copy Markdown
Member

@ambient-code please rebase this

@ambient-code ambient-code bot force-pushed the fix-listen-queue-race branch from 6ec44ae to d55da90 Compare April 8, 2026 14:17
…connect

When an exporter's Listen() gRPC stream fails with a transient error the
queue is no longer deleted immediately.  Instead a cleanup timer (default
2 min) is scheduled.  If the exporter reconnects before the timer fires,
Listen() cancels the timer and inherits the existing queue — ensuring that
any router token already buffered there by a concurrent Dial() call is
delivered to the reconnected exporter.

On clean shutdown (ctx.Done() — lease ended or server stopping) the timer
is cancelled and the queue is removed straight away, so there is no memory
leak for the normal lifecycle.

Fixes #414
@ambient-code ambient-code bot force-pushed the fix-listen-queue-race branch from d55da90 to 65c93a4 Compare April 8, 2026 16:37
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 8, 2026

Rebased onto latest main (force-pushed). The branch now includes the two new commits from main:

No conflicts during rebase. All CI checks were passing before the rebase; the rebased commit (65c93a41) should be clean as well.

Status summary:

  • No unaddressed review comments requesting code changes
  • All CI checks passed on the previous push
  • Rebase onto main completed as requested by @mangelajo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

race condition in listenQueues cleanup causes intermittent 'Connection to exporter lost'

1 participant