StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine

## Summary

`StreamSync.buildTreeSyncWithDatastore` is a single goroutine that both drains the incoming notification channel (`updChan`) **and** calls `syncToRunning` (which calls `ApplyToRunning` → `performRevert`). When `syncToRunning` blocks, the goroutine cannot drain `updChan`. Notifications that arrive during this window wait up to `notificationSendTimeout` (5 s) and are then **silently and permanently dropped** with no recovery path.

## Two sub-problems

### Problem A — `syncToRunning` blocks the consumer goroutine

`buildTreeSyncWithDatastore` is a single `for/select` loop. Both the notification-reading case and the commit cases share the same goroutine:

```go
// pkg/datastore/target/gnmi/stream.go
for {
    select {
    case noti, ok := <-cUS:
        syncTree.AddUpdatesRecursive(...)       // fast, in-memory
    case <-syncResponse:
        syncTree, err = s.syncToRunning(...)    // SLOW – blocks the entire loop
    case <-ticker.C:
        syncTree, err = s.syncToRunning(...)    // SLOW – blocks the entire loop
    }
}
```

While `syncToRunning` executes (inside `ApplyToRunning` / `performRevert`), the goroutine cannot consume from `updChan`.

### Problem B — SyncResponse signal bypasses the task pool

`gnmiSubscribe` submits each `SubscribeResponse_Update` to a task pool (async, multiple concurrent workers) but sends the `SyncResponse` signal **directly** (bypassing the pool). The signal can therefore reach `buildTreeSyncWithDatastore` before some in-flight pool workers have finished writing their notifications to `updChan`. Those notifications then arrive while the goroutine is blocked in `syncToRunning`.

Note: a sentinel task routed through the pool does **not** fix this — with multiple concurrent workers a free worker can execute the sentinel before other workers finish their notification tasks. There is no ordering guarantee across workers.

### The silent drop

```go
func (t *notificationProcessorTask) sendNotificationData(ctx context.Context, data *NotificationData) error {
    select {
    case t.params.notificationResult <- data:
        return nil
    case <-time.After(notificationSendTimeout):
        log.Error(context.DeadlineExceeded, "notification delivery timeout, dropping update", ...)
        return nil   // silently drops; no retry, no re-fetch, no resync trigger
    }
}
```

Once an update is dropped the syncTree is permanently incomplete until the datastore is recreated.

## Observed failure (CI reproduction)

From a real CI run (`config-server` run, `03-deviations` suite, datastore `default.sr2`):

```
09:12:26  CreateDataStore (full sr2 restart)
09:12:26  Starting Sync
09:12:26  SyncResponse flag received  (initial sync duration: 124ms)
           └─ goroutine enters syncToRunning → performRevert (blocks)
09:12:31  "notification delivery timeout, dropping update" x4   (5 s after block)
09:12:36  "notification delivery timeout, dropping update" x3   (performRevert still running)
09:13:21  "SyncRunning due to ticker" → "sync no config changes"  (steady state after drops)
09:13:39  "transaction validation failed … mandatory child [password] does not exist,
           path: /configure/system/security/user-params/local-user/user[user-name=admin]"
           └─ error repeats every ~60 s for 34 minutes until the CI job was cancelled
```

The `admin` user entry reached the syncTree (from the initial sync burst) but the `password` leaf arrived in a post-SyncResponse onChange notification that was dropped. Every subsequent `TransactionSet` validation failed because the mandatory `password` field was missing from the running tree.

A concurrent `data-server` CI run with identical code but different timing did **not** trigger the drop (zero `notification delivery timeout` log entries), confirming this is a non-deterministic race driven by how long `performRevert` holds the goroutine.

## Impact

- Any notification dropped during a `syncToRunning` call permanently corrupts the running view of the device until the datastore is fully recreated.
- In the observed case this caused `config-server` to mark intents `FailedUnRecoverable` and stall finalizer removal for 34+ minutes.
- The problem is most likely to surface after a gNMI stream reconnect (which triggers a full resync and a potentially slow `performRevert`), but it can also occur on every 5-second ticker commit if `performRevert` takes longer than 5 s.

## Options considered

### Option 1 — Goroutine handoff ✅ recommended

Snapshot the local tree, reset to a fresh empty tree immediately, and run `syncToRunning` in a background goroutine. The main loop never blocks on a commit.

```go
case <-syncResponse:
    treeToCommit := syncTree
    syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
    tickerActive = true
    go func() {
        if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
            log.Error(err, "failed committing synctree to running")
        }
    }()
case <-ticker.C:
    if !tickerActive { ... }
    treeToCommit := syncTree
    syncTree, _ = s.runningStore.NewEmptyTree(s.ctx)
    go func() {
        if _, err := s.syncToRunning(treeToCommit, syncTreeMutex, true); err != nil {
            log.Error(err, "failed committing synctree to running")
        }
    }()
```

`syncTreeMutex` already serialises concurrent calls to `syncToRunning`, so back-to-back SyncResponse and ticker commits queue safely. The main loop keeps draining `updChan` at all times.

**Fixes problem A entirely.** After this fix, problem B's impact shrinks to benign: any notification that races past the SyncResponse signal lands in the fresh tree and is committed by the next 5-second ticker (at most 5 s late), rather than being permanently dropped.

### Option 2 — Increase `updChan` buffer size

Change `make(chan *NotificationData, 20)` to a larger value (e.g. 200).

Trivially safe, zero risk. Only raises the threshold at which drops occur — does not eliminate the root cause. Useful as an additional safety margin on top of Option 1, not as a standalone fix.

### Option 3 — Block indefinitely (remove drop timeout)

Replace `time.After(notificationSendTimeout)` with pure blocking, relying on context cancellation for shutdown.

Eliminates drops but pushes the problem upstream: if the consumer goroutine is permanently stuck, pool workers block, the pool fills, and `gnmiSubscribe` eventually freezes the entire stream pipeline. A stuck `performRevert` would then silently freeze the whole sync rather than logging a drop. Does not fix problem A.

### Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1

Before signaling `syncResponse`, drain all in-flight notification tasks for the current window so the commit sees a complete snapshot. Two equivalent implementations:

**VirtualPool per window** — uses the existing `CloseAndWait()` API; no changes to task code:

```go
case *gnmi.SubscribeResponse_SyncResponse:
    taskPool.CloseAndWait()                        // drain all in-flight tasks
    syncResponse <- struct{}{}
    taskPool = s.vpoolFactory.NewVirtualPool(...)  // fresh pool for next window
```

**WaitGroup per window** — lighter allocation (~24 bytes vs full VirtualPool struct); requires threading `wg` into task construction and calling `wg.Done()` inside `Run`:

```go
// on each notification submission
wg.Add(1)
taskPool.Submit(newNotificationProcessorTask(..., wg))

// in notificationProcessorTask.Run
defer wg.Done()

// on SyncResponse
wg.Wait()
syncResponse <- struct{}{}
wg = &sync.WaitGroup{}
```

Both are functionally identical. The VirtualPool approach touches fewer files (coordination stays entirely in `gnmiSubscribe`). The WaitGroup is a lighter primitive but requires changes to the task type. Performance difference is negligible at one operation per 5 s.

**Important: this option only works safely when combined with Option 1.** Without the goroutine handoff, `updChan` can fill while `syncToRunning` is blocking, causing pool workers to stall on `sendNotificationData`. `CloseAndWait()` / `wg.Wait()` then blocks `gnmiSubscribe` for up to 5 s, freezing the gNMI stream reader and causing messages to back up in the receive channel. With Option 1 in place, `updChan` is always being drained, workers complete almost instantly, and the drain call returns in milliseconds.

**Together, Options 1 + 4 give a complete guarantee:** the SyncResponse commit always contains the full pre-sync state (problem B eliminated), and post-sync notifications are never dropped (problem A eliminated).

### Option 5 — Decouple `performRevert` from `ApplyToRunning`

`performRevert` loads all non-running intents, validates the merged tree, and pushes revert deltas to the device. This is not a concern of the sync path — it is a side-effect triggered because the running state changed. Moving it to an async worker (similar to how deviation calc is already structured) would make `ApplyToRunning` return almost immediately, eliminating the blocking at the source.

This is the architecturally cleanest long-term direction but is a larger refactor. Options 1 (+ optionally 4) are the right near-term fix; Option 5 is worth pursuing as a follow-up.

## Recommendation

Apply **Option 1** (goroutine handoff) as the primary fix — it eliminates permanent drops and makes problem B benign. Add **Option 4** (per-window drain, either VirtualPool or WaitGroup flavour) to close out the remaining 5-second lag from problem B. Optionally add a modestly larger buffer (Option 2) as a belt-and-suspenders measure. Track Option 5 (decouple `performRevert`) as a follow-up architectural improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine #440

Summary

Two sub-problems

Problem A — `syncToRunning` blocks the consumer goroutine

Problem B — SyncResponse signal bypasses the task pool

The silent drop

Observed failure (CI reproduction)

Impact

Options considered

Option 1 — Goroutine handoff ✅ recommended

Option 2 — Increase `updChan` buffer size

Option 3 — Block indefinitely (remove drop timeout)

Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1

Option 5 — Decouple `performRevert` from `ApplyToRunning`

Recommendation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

StreamSync: post-SyncResponse notifications silently dropped when syncToRunning blocks the consumer goroutine #440

Description

Summary

Two sub-problems

Problem A — syncToRunning blocks the consumer goroutine

Problem B — SyncResponse signal bypasses the task pool

The silent drop

Observed failure (CI reproduction)

Impact

Options considered

Option 1 — Goroutine handoff ✅ recommended

Option 2 — Increase updChan buffer size

Option 3 — Block indefinitely (remove drop timeout)

Option 4 — Per-window drain (VirtualPool or WaitGroup) ✅ complement to Option 1

Option 5 — Decouple performRevert from ApplyToRunning

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Problem A — `syncToRunning` blocks the consumer goroutine

Option 2 — Increase `updChan` buffer size

Option 5 — Decouple `performRevert` from `ApplyToRunning`