Skip to content

fix(stream): goroutine handoff to prevent post-SyncResponse notification drops#441

Open
steiler wants to merge 2 commits into
mainfrom
fix/440-streamsync-notification-drop
Open

fix(stream): goroutine handoff to prevent post-SyncResponse notification drops#441
steiler wants to merge 2 commits into
mainfrom
fix/440-streamsync-notification-drop

Conversation

@steiler
Copy link
Copy Markdown
Collaborator

@steiler steiler commented May 28, 2026

Fixes #440.

Summary

  • buildTreeSyncWithDatastore previously called syncToRunning inline, blocking the consumer goroutine for the entire duration of ApplyToRunningperformRevert. Notifications arriving during this window waited up to notificationSendTimeout (5 s) and were then silently and permanently dropped, leaving the Running view of the device incomplete until the datastore was recreated.
  • The CI failure in StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine #440 (mandatory password leaf absent for 34+ minutes) was caused exactly by this: a post-SyncResponse onChange notification was dropped while performRevert held the goroutine.

Changes

pkg/datastore/target/gnmi/stream.go

  • On SyncResponse and ticker events, snapshot the local syncTree, reset it to a fresh empty tree, and hand syncToRunning off to a background goroutine. The main select loop never blocks on a commit and keeps draining updChan at all times.
  • atomic.Bool syncRunning — ticker skips if a goroutine is already in flight (skip-if-busy, not queue).
  • tickerChan starts as nil (never fires in Go select) and is activated lazily on the first SyncResponse, replacing the previous tickerActive bool flag — removes one redundant variable.
  • syncToRunning signature simplified: no mutex parameter (each goroutine owns its tree exclusively after handoff), no tree return value.
  • updChanBufSize and notifSendTimeout are configurable fields (defaults unchanged) to enable deterministic regression tests.
  • NewEmptyTree failure after handoff exits the loop rather than continuing with a shared tree that the goroutine is still reading.

pkg/datastore/target/gnmi/stream_test.go (new)

  • TestBuildTreeSyncWithDatastore_PostSyncNotificationsNotDropped — regression test for this bug. Uses a zero-buffer updChan and a 20 ms send timeout: with the old code the consumer goroutine is blocked so pool workers time out and the second ApplyToRunning is never called (test fails); with the fix the goroutine handoff keeps the loop draining and the ticker commits the post-sync notifications (test passes).
  • TestBuildTreeSyncWithDatastore_NewEmptyTreeFailureExits — verifies the loop exits cleanly on NewEmptyTree failure.

docs/adr/0003-streamsync-goroutine-handoff-for-notification-drop-prevention.md (new)

…ion drops

Fixes #440.

buildTreeSyncWithDatastore previously called syncToRunning inline, blocking the
consumer goroutine for the duration of ApplyToRunning → performRevert. Any
notification arriving while the goroutine was blocked waited up to
notificationSendTimeout (5 s) and was then silently and permanently dropped,
leaving the running view of the device incomplete until the datastore was
recreated.

Fix: on SyncResponse and ticker events, snapshot the local syncTree, reset it
to a fresh empty tree, and run syncToRunning in a background goroutine. The
main select loop never blocks on a commit and keeps draining updChan at all
times. Post-SyncResponse notifications land in the fresh syncTree and are
committed by the next ticker (≤ 5 s lag, self-healing, no permanent loss).

Implementation details:
- atomic.Bool syncRunning prevents ticker goroutines from stacking while a
  commit is in flight (skip-if-busy)
- tickerChan starts as nil (never selects) and is activated lazily on the first
  SyncResponse, replacing the previous tickerActive bool flag
- syncToRunning signature simplified: no mutex parameter, no tree return value
- updChanBufSize and notifSendTimeout are configurable fields (defaults
  unchanged) to enable deterministic tests
- NewEmptyTree failure after handoff exits the loop rather than continuing with
  a shared tree

Decision rationale: docs/adr/0003-streamsync-goroutine-handoff-for-notification-drop-prevention.md

Co-authored-by: Cursor <cursoragent@cursor.com>
@steiler steiler requested a review from a team as a code owner May 28, 2026 14:49
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 80.85106% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/datastore/target/gnmi/stream.go 80.85% 6 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine

1 participant