fix(stream): goroutine handoff to prevent post-SyncResponse notification drops#441
Open
steiler wants to merge 2 commits into
Open
fix(stream): goroutine handoff to prevent post-SyncResponse notification drops#441steiler wants to merge 2 commits into
steiler wants to merge 2 commits into
Conversation
…ion drops Fixes #440. buildTreeSyncWithDatastore previously called syncToRunning inline, blocking the consumer goroutine for the duration of ApplyToRunning → performRevert. Any notification arriving while the goroutine was blocked waited up to notificationSendTimeout (5 s) and was then silently and permanently dropped, leaving the running view of the device incomplete until the datastore was recreated. Fix: on SyncResponse and ticker events, snapshot the local syncTree, reset it to a fresh empty tree, and run syncToRunning in a background goroutine. The main select loop never blocks on a commit and keeps draining updChan at all times. Post-SyncResponse notifications land in the fresh syncTree and are committed by the next ticker (≤ 5 s lag, self-healing, no permanent loss). Implementation details: - atomic.Bool syncRunning prevents ticker goroutines from stacking while a commit is in flight (skip-if-busy) - tickerChan starts as nil (never selects) and is activated lazily on the first SyncResponse, replacing the previous tickerActive bool flag - syncToRunning signature simplified: no mutex parameter, no tree return value - updChanBufSize and notifSendTimeout are configurable fields (defaults unchanged) to enable deterministic tests - NewEmptyTree failure after handoff exits the loop rather than continuing with a shared tree Decision rationale: docs/adr/0003-streamsync-goroutine-handoff-for-notification-drop-prevention.md Co-authored-by: Cursor <cursoragent@cursor.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #440.
Summary
buildTreeSyncWithDatastorepreviously calledsyncToRunninginline, blocking the consumer goroutine for the entire duration ofApplyToRunning→performRevert. Notifications arriving during this window waited up tonotificationSendTimeout(5 s) and were then silently and permanently dropped, leaving the Running view of the device incomplete until the datastore was recreated.passwordleaf absent for 34+ minutes) was caused exactly by this: a post-SyncResponseonChange notification was dropped whileperformRevertheld the goroutine.Changes
pkg/datastore/target/gnmi/stream.goSyncResponseand ticker events, snapshot the localsyncTree, reset it to a fresh empty tree, and handsyncToRunningoff to a background goroutine. The mainselectloop never blocks on a commit and keeps drainingupdChanat all times.atomic.Bool syncRunning— ticker skips if a goroutine is already in flight (skip-if-busy, not queue).tickerChanstarts asnil(never fires in Go select) and is activated lazily on the firstSyncResponse, replacing the previoustickerActive boolflag — removes one redundant variable.syncToRunningsignature simplified: no mutex parameter (each goroutine owns its tree exclusively after handoff), no tree return value.updChanBufSizeandnotifSendTimeoutare configurable fields (defaults unchanged) to enable deterministic regression tests.NewEmptyTreefailure after handoff exits the loop rather than continuing with a shared tree that the goroutine is still reading.pkg/datastore/target/gnmi/stream_test.go(new)TestBuildTreeSyncWithDatastore_PostSyncNotificationsNotDropped— regression test for this bug. Uses a zero-bufferupdChanand a 20 ms send timeout: with the old code the consumer goroutine is blocked so pool workers time out and the secondApplyToRunningis never called (test fails); with the fix the goroutine handoff keeps the loop draining and the ticker commits the post-sync notifications (test passes).TestBuildTreeSyncWithDatastore_NewEmptyTreeFailureExits— verifies the loop exits cleanly onNewEmptyTreefailure.docs/adr/0003-streamsync-goroutine-handoff-for-notification-drop-prevention.md(new)