Summary
StreamSync.buildTreeSyncWithDatastore is a single goroutine that both drains the incoming notification channel (updChan) and calls syncToRunning (which calls ApplyToRunning → performRevert). When syncToRunning blocks, the goroutine cannot drain updChan. Notifications that arrive during this window wait up to notificationSendTimeout (5 s) and are then silently and permanently dropped with no recovery path.
Two sub-problems
Problem A — syncToRunning blocks the consumer goroutine
buildTreeSyncWithDatastore is a single for/select loop. Both the notification-reading case and the commit cases share the same goroutine:
// pkg/datastore/target/gnmi/stream.go
for {
select {
case noti, ok := <-cUS:
syncTree.AddUpdatesRecursive(...) // fast, in-memory
case <-syncResponse:
syncTree, err = s.syncToRunning(...) // SLOW – blocks the entire loop
case <-ticker.C:
syncTree, err = s.syncToRunning(...) // SLOW – blocks the entire loop
}
}
While syncToRunning executes (inside ApplyToRunning / performRevert), the goroutine cannot consume from updChan.
Problem B — SyncResponse signal bypasses the task pool
gnmiSubscribe submits each SubscribeResponse_Update to a task pool (async, multiple concurrent workers) but sends the SyncResponse signal directly (bypassing the pool). The signal can therefore reach buildTreeSyncWithDatastore before some in-flight pool workers have finished writing their notifications to updChan. Those notifications then arrive while the goroutine is blocked in syncToRunning.
Note: a sentinel task routed through the pool does not fix this — with multiple concurrent workers a free worker can execute the sentinel before other workers finish their notification tasks. There is no ordering guarantee across workers.
The silent drop
func (t *notificationProcessorTask) sendNotificationData(ctx context.Context, data *NotificationData) error {
select {
case t.params.notificationResult <- data:
return nil
case <-time.After(notificationSendTimeout):
log.Error(context.DeadlineExceeded, "notification delivery timeout, dropping update", ...)
return nil // silently drops; no retry, no re-fetch, no resync trigger
}
}
Once an update is dropped the syncTree is permanently incomplete until the datastore is recreated.
Observed failure (CI reproduction)
From a real CI run (config-server run, 03-deviations suite, datastore default.sr2):
09:12:26 CreateDataStore (full sr2 restart)
09:12:26 Starting Sync
09:12:26 SyncResponse flag received (initial sync duration: 124ms)
└─ goroutine enters syncToRunning → performRevert (blocks)
09:12:31 "notification delivery timeout, dropping update" x4 (5 s after block)
09:12:36 "notification delivery timeout, dropping update" x3 (performRevert still running)
09:13:21 "SyncRunning due to ticker" → "sync no config changes" (steady state after drops)
09:13:39 "transaction validation failed … mandatory child [password] does not exist,
path: /configure/system/security/user-params/local-user/user[user-name=admin]"
└─ error repeats every ~60 s for 34 minutes until the CI job was cancelled
The admin user entry reached the syncTree (from the initial sync burst) but the password leaf arrived in a post-SyncResponse onChange notification that was dropped. Every subsequent TransactionSet validation failed because the mandatory password field was missing from the running tree.
A concurrent data-server CI run with identical code but different timing did not trigger the drop (zero notification delivery timeout log entries), confirming this is a non-deterministic race driven by how long performRevert holds the goroutine.
Impact
- Any notification dropped during a
syncToRunning call permanently corrupts the running view of the device until the datastore is fully recreated.
- In the observed case this caused
config-server to mark intents FailedUnRecoverable and stall finalizer removal for 34+ minutes.
- The problem is most likely to surface after a gNMI stream reconnect (which triggers a full resync and a potentially slow
performRevert), but it can also occur on every 5-second ticker commit if performRevert takes longer than 5 s.
Options considered
Option 1 — Goroutine handoff ✅ recommended
Snapshot the local tree, reset to a fresh empty tree immediately, and run syncToRunning in a background goroutine. The main loop never blocks on a commit.
case <-syncResponse:
treeToCommit := syncTree
syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
tickerActive = true
go func() {
if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
log.Error(err, "failed committing synctree to running")
}
}()
case <-ticker.C:
if !tickerActive { ... }
treeToCommit := syncTree
syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
go func() {
if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
log.Error(err, "failed committing synctree to running")
}
}()
syncTreeMutex already serialises concurrent calls to syncToRunning, so back-to-back SyncResponse and ticker commits queue safely. The main loop keeps draining updChan at all times.
Fixes problem A entirely. After this fix, problem B's impact shrinks to benign: any notification that races past the SyncResponse signal lands in the fresh tree and is committed by the next 5-second ticker (at most 5 s late), rather than being permanently dropped.
Option 2 — Increase updChan buffer size
Change make(chan *NotificationData, 20) to a larger value (e.g. 200).
Trivially safe, zero risk. Only raises the threshold at which drops occur — does not eliminate the root cause. Useful as an additional safety margin on top of Option 1, not as a standalone fix.
Option 3 — Block indefinitely (remove drop timeout)
Replace time.After(notificationSendTimeout) with pure blocking, relying on context cancellation for shutdown.
Eliminates drops but pushes the problem upstream: if the consumer goroutine is permanently stuck, pool workers block, the pool fills, and gnmiSubscribe eventually freezes the entire stream pipeline. A stuck performRevert would then silently freeze the whole sync rather than logging a drop. Does not fix problem A.
Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1
Before signaling syncResponse, drain all in-flight notification tasks for the current window so the commit sees a complete snapshot. Two equivalent implementations:
VirtualPool per window — uses the existing CloseAndWait() API; no changes to task code:
case *gnmi.SubscribeResponse_SyncResponse:
taskPool.CloseAndWait() // drain all in-flight tasks
syncResponse <- struct{}{}
taskPool = s.vpoolFactory.NewVirtualPool(...) // fresh pool for next window
WaitGroup per window — lighter allocation (~24 bytes vs full VirtualPool struct); requires threading wg into task construction and calling wg.Done() inside Run:
// on each notification submission
wg.Add(1)
taskPool.Submit(newNotificationProcessorTask(..., wg))
// in notificationProcessorTask.Run
defer wg.Done()
// on SyncResponse
wg.Wait()
syncResponse <- struct{}{}
wg = &sync.WaitGroup{}
Both are functionally identical. The VirtualPool approach touches fewer files (coordination stays entirely in gnmiSubscribe). The WaitGroup is a lighter primitive but requires changes to the task type. Performance difference is negligible at one operation per 5 s.
Important: this option only works safely when combined with Option 1. Without the goroutine handoff, updChan can fill while syncToRunning is blocking, causing pool workers to stall on sendNotificationData. CloseAndWait() / wg.Wait() then blocks gnmiSubscribe for up to 5 s, freezing the gNMI stream reader and causing messages to back up in the receive channel. With Option 1 in place, updChan is always being drained, workers complete almost instantly, and the drain call returns in milliseconds.
Together, Options 1 + 4 give a complete guarantee: the SyncResponse commit always contains the full pre-sync state (problem B eliminated), and post-sync notifications are never dropped (problem A eliminated).
Option 5 — Decouple performRevert from ApplyToRunning
performRevert loads all non-running intents, validates the merged tree, and pushes revert deltas to the device. This is not a concern of the sync path — it is a side-effect triggered because the running state changed. Moving it to an async worker (similar to how deviation calc is already structured) would make ApplyToRunning return almost immediately, eliminating the blocking at the source.
This is the architecturally cleanest long-term direction but is a larger refactor. Options 1 (+ optionally 4) are the right near-term fix; Option 5 is worth pursuing as a follow-up.
Recommendation
Apply Option 1 (goroutine handoff) as the primary fix — it eliminates permanent drops and makes problem B benign. Add Option 4 (per-window drain, either VirtualPool or WaitGroup flavour) to close out the remaining 5-second lag from problem B. Optionally add a modestly larger buffer (Option 2) as a belt-and-suspenders measure. Track Option 5 (decouple performRevert) as a follow-up architectural improvement.
Summary
StreamSync.buildTreeSyncWithDatastoreis a single goroutine that both drains the incoming notification channel (updChan) and callssyncToRunning(which callsApplyToRunning→performRevert). WhensyncToRunningblocks, the goroutine cannot drainupdChan. Notifications that arrive during this window wait up tonotificationSendTimeout(5 s) and are then silently and permanently dropped with no recovery path.Two sub-problems
Problem A —
syncToRunningblocks the consumer goroutinebuildTreeSyncWithDatastoreis a singlefor/selectloop. Both the notification-reading case and the commit cases share the same goroutine:While
syncToRunningexecutes (insideApplyToRunning/performRevert), the goroutine cannot consume fromupdChan.Problem B — SyncResponse signal bypasses the task pool
gnmiSubscribesubmits eachSubscribeResponse_Updateto a task pool (async, multiple concurrent workers) but sends theSyncResponsesignal directly (bypassing the pool). The signal can therefore reachbuildTreeSyncWithDatastorebefore some in-flight pool workers have finished writing their notifications toupdChan. Those notifications then arrive while the goroutine is blocked insyncToRunning.Note: a sentinel task routed through the pool does not fix this — with multiple concurrent workers a free worker can execute the sentinel before other workers finish their notification tasks. There is no ordering guarantee across workers.
The silent drop
Once an update is dropped the syncTree is permanently incomplete until the datastore is recreated.
Observed failure (CI reproduction)
From a real CI run (
config-serverrun,03-deviationssuite, datastoredefault.sr2):The
adminuser entry reached the syncTree (from the initial sync burst) but thepasswordleaf arrived in a post-SyncResponse onChange notification that was dropped. Every subsequentTransactionSetvalidation failed because the mandatorypasswordfield was missing from the running tree.A concurrent
data-serverCI run with identical code but different timing did not trigger the drop (zeronotification delivery timeoutlog entries), confirming this is a non-deterministic race driven by how longperformRevertholds the goroutine.Impact
syncToRunningcall permanently corrupts the running view of the device until the datastore is fully recreated.config-serverto mark intentsFailedUnRecoverableand stall finalizer removal for 34+ minutes.performRevert), but it can also occur on every 5-second ticker commit ifperformReverttakes longer than 5 s.Options considered
Option 1 — Goroutine handoff ✅ recommended
Snapshot the local tree, reset to a fresh empty tree immediately, and run
syncToRunningin a background goroutine. The main loop never blocks on a commit.syncTreeMutexalready serialises concurrent calls tosyncToRunning, so back-to-back SyncResponse and ticker commits queue safely. The main loop keeps drainingupdChanat all times.Fixes problem A entirely. After this fix, problem B's impact shrinks to benign: any notification that races past the SyncResponse signal lands in the fresh tree and is committed by the next 5-second ticker (at most 5 s late), rather than being permanently dropped.
Option 2 — Increase
updChanbuffer sizeChange
make(chan *NotificationData, 20)to a larger value (e.g. 200).Trivially safe, zero risk. Only raises the threshold at which drops occur — does not eliminate the root cause. Useful as an additional safety margin on top of Option 1, not as a standalone fix.
Option 3 — Block indefinitely (remove drop timeout)
Replace
time.After(notificationSendTimeout)with pure blocking, relying on context cancellation for shutdown.Eliminates drops but pushes the problem upstream: if the consumer goroutine is permanently stuck, pool workers block, the pool fills, and
gnmiSubscribeeventually freezes the entire stream pipeline. A stuckperformRevertwould then silently freeze the whole sync rather than logging a drop. Does not fix problem A.Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1
Before signaling
syncResponse, drain all in-flight notification tasks for the current window so the commit sees a complete snapshot. Two equivalent implementations:VirtualPool per window — uses the existing
CloseAndWait()API; no changes to task code:WaitGroup per window — lighter allocation (~24 bytes vs full VirtualPool struct); requires threading
wginto task construction and callingwg.Done()insideRun:Both are functionally identical. The VirtualPool approach touches fewer files (coordination stays entirely in
gnmiSubscribe). The WaitGroup is a lighter primitive but requires changes to the task type. Performance difference is negligible at one operation per 5 s.Important: this option only works safely when combined with Option 1. Without the goroutine handoff,
updChancan fill whilesyncToRunningis blocking, causing pool workers to stall onsendNotificationData.CloseAndWait()/wg.Wait()then blocksgnmiSubscribefor up to 5 s, freezing the gNMI stream reader and causing messages to back up in the receive channel. With Option 1 in place,updChanis always being drained, workers complete almost instantly, and the drain call returns in milliseconds.Together, Options 1 + 4 give a complete guarantee: the SyncResponse commit always contains the full pre-sync state (problem B eliminated), and post-sync notifications are never dropped (problem A eliminated).
Option 5 — Decouple
performRevertfromApplyToRunningperformRevertloads all non-running intents, validates the merged tree, and pushes revert deltas to the device. This is not a concern of the sync path — it is a side-effect triggered because the running state changed. Moving it to an async worker (similar to how deviation calc is already structured) would makeApplyToRunningreturn almost immediately, eliminating the blocking at the source.This is the architecturally cleanest long-term direction but is a larger refactor. Options 1 (+ optionally 4) are the right near-term fix; Option 5 is worth pursuing as a follow-up.
Recommendation
Apply Option 1 (goroutine handoff) as the primary fix — it eliminates permanent drops and makes problem B benign. Add Option 4 (per-window drain, either VirtualPool or WaitGroup flavour) to close out the remaining 5-second lag from problem B. Optionally add a modestly larger buffer (Option 2) as a belt-and-suspenders measure. Track Option 5 (decouple
performRevert) as a follow-up architectural improvement.