Skip to content

StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine #440

@steiler

Description

@steiler

Summary

StreamSync.buildTreeSyncWithDatastore is a single goroutine that both drains the incoming notification channel (updChan) and calls syncToRunning (which calls ApplyToRunningperformRevert). When syncToRunning blocks, the goroutine cannot drain updChan. Notifications that arrive during this window wait up to notificationSendTimeout (5 s) and are then silently and permanently dropped with no recovery path.

Two sub-problems

Problem A — syncToRunning blocks the consumer goroutine

buildTreeSyncWithDatastore is a single for/select loop. Both the notification-reading case and the commit cases share the same goroutine:

// pkg/datastore/target/gnmi/stream.go
for {
    select {
    case noti, ok := <-cUS:
        syncTree.AddUpdatesRecursive(...)       // fast, in-memory
    case <-syncResponse:
        syncTree, err = s.syncToRunning(...)    // SLOW – blocks the entire loop
    case <-ticker.C:
        syncTree, err = s.syncToRunning(...)    // SLOW – blocks the entire loop
    }
}

While syncToRunning executes (inside ApplyToRunning / performRevert), the goroutine cannot consume from updChan.

Problem B — SyncResponse signal bypasses the task pool

gnmiSubscribe submits each SubscribeResponse_Update to a task pool (async, multiple concurrent workers) but sends the SyncResponse signal directly (bypassing the pool). The signal can therefore reach buildTreeSyncWithDatastore before some in-flight pool workers have finished writing their notifications to updChan. Those notifications then arrive while the goroutine is blocked in syncToRunning.

Note: a sentinel task routed through the pool does not fix this — with multiple concurrent workers a free worker can execute the sentinel before other workers finish their notification tasks. There is no ordering guarantee across workers.

The silent drop

func (t *notificationProcessorTask) sendNotificationData(ctx context.Context, data *NotificationData) error {
    select {
    case t.params.notificationResult <- data:
        return nil
    case <-time.After(notificationSendTimeout):
        log.Error(context.DeadlineExceeded, "notification delivery timeout, dropping update", ...)
        return nil   // silently drops; no retry, no re-fetch, no resync trigger
    }
}

Once an update is dropped the syncTree is permanently incomplete until the datastore is recreated.

Observed failure (CI reproduction)

From a real CI run (config-server run, 03-deviations suite, datastore default.sr2):

09:12:26  CreateDataStore (full sr2 restart)
09:12:26  Starting Sync
09:12:26  SyncResponse flag received  (initial sync duration: 124ms)
           └─ goroutine enters syncToRunning → performRevert (blocks)
09:12:31  "notification delivery timeout, dropping update" x4   (5 s after block)
09:12:36  "notification delivery timeout, dropping update" x3   (performRevert still running)
09:13:21  "SyncRunning due to ticker" → "sync no config changes"  (steady state after drops)
09:13:39  "transaction validation failed … mandatory child [password] does not exist,
           path: /configure/system/security/user-params/local-user/user[user-name=admin]"
           └─ error repeats every ~60 s for 34 minutes until the CI job was cancelled

The admin user entry reached the syncTree (from the initial sync burst) but the password leaf arrived in a post-SyncResponse onChange notification that was dropped. Every subsequent TransactionSet validation failed because the mandatory password field was missing from the running tree.

A concurrent data-server CI run with identical code but different timing did not trigger the drop (zero notification delivery timeout log entries), confirming this is a non-deterministic race driven by how long performRevert holds the goroutine.

Impact

  • Any notification dropped during a syncToRunning call permanently corrupts the running view of the device until the datastore is fully recreated.
  • In the observed case this caused config-server to mark intents FailedUnRecoverable and stall finalizer removal for 34+ minutes.
  • The problem is most likely to surface after a gNMI stream reconnect (which triggers a full resync and a potentially slow performRevert), but it can also occur on every 5-second ticker commit if performRevert takes longer than 5 s.

Options considered

Option 1 — Goroutine handoff ✅ recommended

Snapshot the local tree, reset to a fresh empty tree immediately, and run syncToRunning in a background goroutine. The main loop never blocks on a commit.

case <-syncResponse:
    treeToCommit := syncTree
    syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
    tickerActive = true
    go func() {
        if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
            log.Error(err, "failed committing synctree to running")
        }
    }()
case <-ticker.C:
    if !tickerActive { ... }
    treeToCommit := syncTree
    syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
    go func() {
        if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
            log.Error(err, "failed committing synctree to running")
        }
    }()

syncTreeMutex already serialises concurrent calls to syncToRunning, so back-to-back SyncResponse and ticker commits queue safely. The main loop keeps draining updChan at all times.

Fixes problem A entirely. After this fix, problem B's impact shrinks to benign: any notification that races past the SyncResponse signal lands in the fresh tree and is committed by the next 5-second ticker (at most 5 s late), rather than being permanently dropped.

Option 2 — Increase updChan buffer size

Change make(chan *NotificationData, 20) to a larger value (e.g. 200).

Trivially safe, zero risk. Only raises the threshold at which drops occur — does not eliminate the root cause. Useful as an additional safety margin on top of Option 1, not as a standalone fix.

Option 3 — Block indefinitely (remove drop timeout)

Replace time.After(notificationSendTimeout) with pure blocking, relying on context cancellation for shutdown.

Eliminates drops but pushes the problem upstream: if the consumer goroutine is permanently stuck, pool workers block, the pool fills, and gnmiSubscribe eventually freezes the entire stream pipeline. A stuck performRevert would then silently freeze the whole sync rather than logging a drop. Does not fix problem A.

Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1

Before signaling syncResponse, drain all in-flight notification tasks for the current window so the commit sees a complete snapshot. Two equivalent implementations:

VirtualPool per window — uses the existing CloseAndWait() API; no changes to task code:

case *gnmi.SubscribeResponse_SyncResponse:
    taskPool.CloseAndWait()                        // drain all in-flight tasks
    syncResponse <- struct{}{}
    taskPool = s.vpoolFactory.NewVirtualPool(...)  // fresh pool for next window

WaitGroup per window — lighter allocation (~24 bytes vs full VirtualPool struct); requires threading wg into task construction and calling wg.Done() inside Run:

// on each notification submission
wg.Add(1)
taskPool.Submit(newNotificationProcessorTask(..., wg))

// in notificationProcessorTask.Run
defer wg.Done()

// on SyncResponse
wg.Wait()
syncResponse <- struct{}{}
wg = &sync.WaitGroup{}

Both are functionally identical. The VirtualPool approach touches fewer files (coordination stays entirely in gnmiSubscribe). The WaitGroup is a lighter primitive but requires changes to the task type. Performance difference is negligible at one operation per 5 s.

Important: this option only works safely when combined with Option 1. Without the goroutine handoff, updChan can fill while syncToRunning is blocking, causing pool workers to stall on sendNotificationData. CloseAndWait() / wg.Wait() then blocks gnmiSubscribe for up to 5 s, freezing the gNMI stream reader and causing messages to back up in the receive channel. With Option 1 in place, updChan is always being drained, workers complete almost instantly, and the drain call returns in milliseconds.

Together, Options 1 + 4 give a complete guarantee: the SyncResponse commit always contains the full pre-sync state (problem B eliminated), and post-sync notifications are never dropped (problem A eliminated).

Option 5 — Decouple performRevert from ApplyToRunning

performRevert loads all non-running intents, validates the merged tree, and pushes revert deltas to the device. This is not a concern of the sync path — it is a side-effect triggered because the running state changed. Moving it to an async worker (similar to how deviation calc is already structured) would make ApplyToRunning return almost immediately, eliminating the blocking at the source.

This is the architecturally cleanest long-term direction but is a larger refactor. Options 1 (+ optionally 4) are the right near-term fix; Option 5 is worth pursuing as a follow-up.

Recommendation

Apply Option 1 (goroutine handoff) as the primary fix — it eliminates permanent drops and makes problem B benign. Add Option 4 (per-window drain, either VirtualPool or WaitGroup flavour) to close out the remaining 5-second lag from problem B. Optionally add a modestly larger buffer (Option 2) as a belt-and-suspenders measure. Track Option 5 (decouple performRevert) as a follow-up architectural improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions