fix: prevent reflect loop semaphore from getting stuck on cleanup hang#323
fix: prevent reflect loop semaphore from getting stuck on cleanup hang#323
Conversation
The inner finally block in SendViaOrchestratorReflectAsync could hang indefinitely if SendPromptAndWaitAsync blocked on a broken SDK connection (e.g., StreamJsonRpc serialization errors). This prevented the outer finally from releasing the semaphore, causing all future messages to queue with 'Reflect loop already running' while IsProcessing was false. Root cause: The cleanup code awaited SendPromptAndWaitAsync for leftover queued prompts before releasing the loop lock. If the SDK connection was broken, this await would never complete because SendAsync passes CancellationToken.None. Fix: Collect leftover prompts synchronously, then fire-and-forget their delivery in a background task. This ensures the semaphore is always released immediately, even if cleanup operations fail or timeout. Symptoms fixed: - Messages queuing forever with 'Reflect loop already running' - IsProcessing: False but semaphore held - StreamJsonRpc cancellation errors in crash log Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
left a comment
There was a problem hiding this comment.
Review Summary
This PR fixes a real production deadlock: the inner finally in SendViaOrchestratorReflectAsync was awaiting SendPromptAndWaitAsync on leftover queued prompts. On a broken SDK connection, that await would never complete, permanently holding loopLock and causing all subsequent messages to queue with "Reflect loop already running" while IsProcessing was false. The fix is mechanically correct about removing the blocking await. However, it introduces a new concurrency hazard that causes silent data loss and the added tests do not cover the production code path.
Findings
🔴 CRITICAL -- CopilotService.Organization.cs:~2355 -- Concurrent sends to the same orchestrator session cause silent message loss
After loopLock.Release() runs (outer finally), a new caller can immediately acquire the semaphore and start a new reflect loop -- calling SendPromptAndWaitAsync(orchestratorName, ...) for its planning phase. The background cleanup task is simultaneously calling SendPromptAndWaitAsync(orchName, ...) on the same session without holding the lock.
SendPromptAsync uses Interlocked.CompareExchange(ref state.SendingFlag, 1, 0) -- the losing concurrent caller receives InvalidOperationException. That exception is caught and silently logged as Debug(...), permanently dropping the user's queued prompt -- the very prompt that was already acknowledged to the user with "📨 New user message queued". This is worse than the original bug: the deadlock was recoverable by restart; this causes unrecoverable, silent message loss.
Additionally, the cleanup task injects "[User sent a message -- the reflection loop has completed]" into the orchestrator mid-planning of the new loop, poisoning its context with a stale sentinel.
Suggested fix (two options):
- Option A: Keep cleanup inside the semaphore but add a 30s
CancellationTokenSourceso the semaphore releases promptly. - Option B (safest): Instead of fire-and-forget sending, re-enqueue leftovers back into
_reflectQueuedPrompts. The next loop iteration already drains that queue. Zero-cost, race-free, no data loss.
🟡 MODERATE -- MultiAgentRegressionTests.cs:1731 and 1773 -- Tests provide zero regression coverage for the actual fix
ReflectLoopLock_ReleasedEvenIfCleanupFails creates a local SemaphoreSlim, manually releases it, and asserts it's acquirable -- a tautology. It never calls SendViaOrchestratorReflectAsync. If the fix is reverted, this test still passes.
QueuedPrompts_CollectedBeforeSemaphoreRelease validates that ConcurrentQueue.TryDequeue drains into a List -- testing standard library behavior.
🟡 MODERATE -- CopilotService.Organization.cs:~2333-2335 -- Comment "the user will see them in history anyway" misrepresents failure behavior
When a prompt is enqueued, AddOrchestratorSystemMessage adds "📨 queued" to the UI, but the actual content is only delivered to the model by SendPromptAndWaitAsync. If the fire-and-forget fails, the user sees the notification with no follow-up error and their message is gone.
🟢 MINOR -- CopilotService.Organization.cs:2351-2352 -- var orchName = orchestratorName is unnecessary
orchestratorName is a method parameter. In an async method the compiler lifts it onto the state machine class. The copy is a no-op and the comment implies a risk that doesn't exist.
🟢 MINOR -- CopilotService.Organization.cs:2362 -- Per-prompt CancellationTokenSource creates unbounded worst-case latency
Each leftover prompt gets an independent 30-second CTS. With N queued prompts, worst-case cleanup takes N × 30s. A single aggregate CTS would cap total batch time.
Code Review — PR #323 (fix: prevent reflect loop semaphore from getting stuck on cleanup hang)CI: 🔴 CRITICAL — Fire-and-forget cleanup races with new reflect loop → silent data lossCopilotService.Organization.cs:~2355 (4/4 models + prior reviewer) The fix moves leftover prompt delivery into
This is strictly worse than the original bug: the deadlock was recoverable by restart; this causes unrecoverable silent message loss for prompts the user was already told were queued. Recommended fix (Option B from prior review): Re-enqueue the leftovers back into // Instead of fire-and-forget:
if (_reflectQueuedPrompts.TryGetValue(groupId, out var remainingQueue))
{
while (remainingQueue.TryDequeue(out var leftover))
_reflectQueuedPrompts
.GetOrAdd(groupId, _ => new ConcurrentQueue<string>())
.Enqueue(leftover); // re-enqueue for next loop
}Or if a new queue entry is cleaner: drain, then after 🟡 MODERATE — New tests provide zero regression coverageMultiAgentRegressionTests.cs:1731, 1773 (2/4 models + prior reviewer)
Suggested tests:
Verdict🔴 Do not merge — the fix trades an infrequent deadlock (recoverable by restart) for guaranteed silent data loss on any queued-prompt scenario where a new loop starts within ~30 seconds. Option B (re-enqueue) is a safer, simpler fix that eliminates the race entirely. |
🔧 Status UpdateThis PR merges cleanly after #302. The semaphore-stuck fix is independent of #302's watchdog changes. Addressing Round 1 review findings now:
|
The inner finally block in SendViaOrchestratorReflectAsync awaited SendPromptAndWaitAsync for leftover queued prompts while still holding the semaphore. If the SDK connection was broken (StreamJsonRpc errors), this await would never complete, permanently blocking all future messages with 'Reflect loop already running'. Fix: Collect leftover prompts synchronously in the finally block, then deliver them in a fire-and-forget task with a 30s timeout AFTER the semaphore is released. The semaphore is now always released promptly regardless of SDK connection health. Also includes PR #352 fixes: - Accept [[GROUP_REFLECT_COMPLETE]] when all workers were attempted (not just succeeded), preventing infinite delegation loops - Chat history fallback for empty worker responses (SDK bug #299) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing — this fix has been incorporated into PR #352 (fix/orchestrator-delegation-failures). The semaphore hang fix is applied with the same approach: collect prompts synchronously, release semaphore immediately, fire-and-forget delivery with 30s timeout. |
Bug Report
[Session: Implement & Challenge-orchestrator] Messages kept queueing with 'Reflect loop already running' but nothing was actually running (IsProcessing: False).
Root Cause
The inner finally block in \SendViaOrchestratorReflectAsync\ could hang indefinitely if \SendPromptAndWaitAsync\ blocked on a broken SDK connection (e.g., StreamJsonRpc serialization errors during cancellation). This prevented the outer finally from releasing the semaphore.
The issue was that:
Fix
Refactored the cleanup to be non-blocking:
This ensures the semaphore is always released within a predictable time, even if cleanup operations fail.
Testing
Symptoms Fixed