fix: prevent reflect loop semaphore from getting stuck on cleanup hang by PureWeen · Pull Request #323 · PureWeen/PolyPilot

PureWeen · 2026-03-09T05:29:50Z

Bug Report

[Session: Implement & Challenge-orchestrator] Messages kept queueing with 'Reflect loop already running' but nothing was actually running (IsProcessing: False).

Root Cause

The inner finally block in \SendViaOrchestratorReflectAsync\ could hang indefinitely if \SendPromptAndWaitAsync\ blocked on a broken SDK connection (e.g., StreamJsonRpc serialization errors during cancellation). This prevented the outer finally from releasing the semaphore.

The issue was that:

Cleanup code awaited \SendPromptAndWaitAsync\ for leftover queued prompts
If the SDK connection was broken, this await would never complete
The 10-minute timeout didn't help because \SendAsync\ passes \CancellationToken.None\ internally
The semaphore release in the outer finally never ran

Fix

Refactored the cleanup to be non-blocking:

Collect leftover prompts synchronously (drain the queue)
Release the semaphore immediately (outer finally runs)
Fire-and-forget the prompt delivery in a background task with a 30-second timeout

This ensures the semaphore is always released within a predictable time, even if cleanup operations fail.

Testing

Added \ReflectLoopLock_ReleasedEvenIfCleanupFails\ test verifying semaphore release timing
Added \QueuedPrompts_CollectedBeforeSemaphoreRelease\ test verifying prompt collection behavior
All existing tests pass (2 pre-existing failures in InputValidationTests unrelated to this change)
Windows build verified

Symptoms Fixed

Messages queueing forever with 'Reflect loop already running'
\IsProcessing: False\ but semaphore held
StreamJsonRpc cancellation errors in crash log

The inner finally block in SendViaOrchestratorReflectAsync could hang indefinitely if SendPromptAndWaitAsync blocked on a broken SDK connection (e.g., StreamJsonRpc serialization errors). This prevented the outer finally from releasing the semaphore, causing all future messages to queue with 'Reflect loop already running' while IsProcessing was false. Root cause: The cleanup code awaited SendPromptAndWaitAsync for leftover queued prompts before releasing the loop lock. If the SDK connection was broken, this await would never complete because SendAsync passes CancellationToken.None. Fix: Collect leftover prompts synchronously, then fire-and-forget their delivery in a background task. This ensures the semaphore is always released immediately, even if cleanup operations fail or timeout. Symptoms fixed: - Messages queuing forever with 'Reflect loop already running' - IsProcessing: False but semaphore held - StreamJsonRpc cancellation errors in crash log Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen

Review Summary

This PR fixes a real production deadlock: the inner finally in SendViaOrchestratorReflectAsync was awaiting SendPromptAndWaitAsync on leftover queued prompts. On a broken SDK connection, that await would never complete, permanently holding loopLock and causing all subsequent messages to queue with "Reflect loop already running" while IsProcessing was false. The fix is mechanically correct about removing the blocking await. However, it introduces a new concurrency hazard that causes silent data loss and the added tests do not cover the production code path.

Findings

🔴 CRITICAL -- CopilotService.Organization.cs:~2355 -- Concurrent sends to the same orchestrator session cause silent message loss

After loopLock.Release() runs (outer finally), a new caller can immediately acquire the semaphore and start a new reflect loop -- calling SendPromptAndWaitAsync(orchestratorName, ...) for its planning phase. The background cleanup task is simultaneously calling SendPromptAndWaitAsync(orchName, ...) on the same session without holding the lock.

SendPromptAsync uses Interlocked.CompareExchange(ref state.SendingFlag, 1, 0) -- the losing concurrent caller receives InvalidOperationException. That exception is caught and silently logged as Debug(...), permanently dropping the user's queued prompt -- the very prompt that was already acknowledged to the user with "📨 New user message queued". This is worse than the original bug: the deadlock was recoverable by restart; this causes unrecoverable, silent message loss.

Additionally, the cleanup task injects "[User sent a message -- the reflection loop has completed]" into the orchestrator mid-planning of the new loop, poisoning its context with a stale sentinel.

Suggested fix (two options):

Option A: Keep cleanup inside the semaphore but add a 30s CancellationTokenSource so the semaphore releases promptly.
Option B (safest): Instead of fire-and-forget sending, re-enqueue leftovers back into _reflectQueuedPrompts. The next loop iteration already drains that queue. Zero-cost, race-free, no data loss.

🟡 MODERATE -- MultiAgentRegressionTests.cs:1731 and 1773 -- Tests provide zero regression coverage for the actual fix

ReflectLoopLock_ReleasedEvenIfCleanupFails creates a local SemaphoreSlim, manually releases it, and asserts it's acquirable -- a tautology. It never calls SendViaOrchestratorReflectAsync. If the fix is reverted, this test still passes.

QueuedPrompts_CollectedBeforeSemaphoreRelease validates that ConcurrentQueue.TryDequeue drains into a List -- testing standard library behavior.

🟡 MODERATE -- CopilotService.Organization.cs:~2333-2335 -- Comment "the user will see them in history anyway" misrepresents failure behavior

When a prompt is enqueued, AddOrchestratorSystemMessage adds "📨 queued" to the UI, but the actual content is only delivered to the model by SendPromptAndWaitAsync. If the fire-and-forget fails, the user sees the notification with no follow-up error and their message is gone.

🟢 MINOR -- CopilotService.Organization.cs:2351-2352 -- var orchName = orchestratorName is unnecessary

orchestratorName is a method parameter. In an async method the compiler lifts it onto the state machine class. The copy is a no-op and the comment implies a risk that doesn't exist.

🟢 MINOR -- CopilotService.Organization.cs:2362 -- Per-prompt CancellationTokenSource creates unbounded worst-case latency

Each leftover prompt gets an independent 30-second CTS. With N queued prompts, worst-case cleanup takes N × 30s. A single aggregate CTS would cap total batch time.

PureWeen · 2026-03-10T20:39:07Z

Code Review — PR #323 (fix: prevent reflect loop semaphore from getting stuck on cleanup hang)

CI: ⚠️ No checks on branch (pre-existing gap)
Prior review: 1 COMMENTED review already present — findings confirmed below.

🔴 CRITICAL — Fire-and-forget cleanup races with new reflect loop → silent data loss

CopilotService.Organization.cs:~2355 (4/4 models + prior reviewer)

The fix moves leftover prompt delivery into _ = Task.Run(...) that runs after loopLock.Release(). This creates a real, high-probability race:

Inner finally calls Task.Run(...) and returns immediately
Outer finally calls loopLock.Release()
A waiting caller acquires the lock and starts a new reflect loop — calling SendPromptAsync(orchestratorName, ...)
The cleanup task simultaneously calls SendPromptAndWaitAsync(orchName, ...) on the same session
SendPromptAsync uses Interlocked.CompareExchange(ref state.SendingFlag, 1, 0) — only one call wins; the other throws InvalidOperationException
The cleanup task catches it as Debug(...) — the user's queued prompt is permanently and silently dropped, despite being acknowledged with "📨 queued"

This is strictly worse than the original bug: the deadlock was recoverable by restart; this causes unrecoverable silent message loss for prompts the user was already told were queued.

Recommended fix (Option B from prior review): Re-enqueue the leftovers back into _reflectQueuedPrompts instead of fire-and-forgetting them. The next loop iteration already drains that queue — zero cost, race-free, no data loss:

// Instead of fire-and-forget:
if (_reflectQueuedPrompts.TryGetValue(groupId, out var remainingQueue))
{
    while (remainingQueue.TryDequeue(out var leftover))
        _reflectQueuedPrompts
            .GetOrAdd(groupId, _ => new ConcurrentQueue<string>())
            .Enqueue(leftover);  // re-enqueue for next loop
}

Or if a new queue entry is cleaner: drain, then after loopLock.Release(), allow the next natural SendViaOrchestratorReflectAsync invocation to handle them. The caller that re-queues a prompt will trigger the loop again naturally.

🟡 MODERATE — New tests provide zero regression coverage

MultiAgentRegressionTests.cs:1731, 1773 (2/4 models + prior reviewer)

ReflectLoopLock_ReleasedEvenIfCleanupFails creates a local SemaphoreSlim, manually releases it, and asserts it's acquirable — a tautology. It never calls SendViaOrchestratorReflectAsync. If the fix were reverted, the test still passes.

QueuedPrompts_CollectedBeforeSemaphoreRelease validates that ConcurrentQueue.TryDequeue drains into a List — testing standard library behavior, not the production code path.

Suggested tests:

Verify that _reflectQueuedPrompts is empty after the loop exits (no silent discard)
Verify semaphore is released even when SendPromptAndWaitAsync throws a non-cancellation exception
If Option B is adopted: verify re-enqueued prompts are delivered on the next loop invocation

Verdict

🔴 Do not merge — the fix trades an infrequent deadlock (recoverable by restart) for guaranteed silent data loss on any queued-prompt scenario where a new loop starts within ~30 seconds. Option B (re-enqueue) is a safer, simpler fix that eliminates the race entirely.

PureWeen · 2026-03-11T05:25:00Z

🔧 Status Update

This PR merges cleanly after #302. The semaphore-stuck fix is independent of #302's watchdog changes. Addressing Round 1 review findings now:

Fire-and-forget cleanup race condition (capture local state before semaphore release)
Tautological test improvement
SafeFireAndForget wrapper for cleanup task

The inner finally block in SendViaOrchestratorReflectAsync awaited SendPromptAndWaitAsync for leftover queued prompts while still holding the semaphore. If the SDK connection was broken (StreamJsonRpc errors), this await would never complete, permanently blocking all future messages with 'Reflect loop already running'. Fix: Collect leftover prompts synchronously in the finally block, then deliver them in a fire-and-forget task with a 30s timeout AFTER the semaphore is released. The semaphore is now always released promptly regardless of SDK connection health. Also includes PR #352 fixes: - Accept [[GROUP_REFLECT_COMPLETE]] when all workers were attempted (not just succeeded), preventing infinite delegation loops - Chat history fallback for empty worker responses (SDK bug #299) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-11T13:48:29Z

Closing — this fix has been incorporated into PR #352 (fix/orchestrator-delegation-failures). The semaphore hang fix is applied with the same approach: collect prompts synchronously, release semaphore immediately, fire-and-forget delivery with 30s timeout.

PureWeen commented Mar 9, 2026

View reviewed changes

PureWeen closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent reflect loop semaphore from getting stuck on cleanup hang#323

fix: prevent reflect loop semaphore from getting stuck on cleanup hang#323
PureWeen wants to merge 1 commit intomainfrom
fix/session-implement--challenge-orchestrato-20260309-0516

PureWeen commented Mar 9, 2026

Uh oh!

PureWeen left a comment

Uh oh!

PureWeen commented Mar 10, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 9, 2026

Bug Report

Root Cause

Fix

Testing

Symptoms Fixed

Uh oh!

PureWeen left a comment

Choose a reason for hiding this comment

Review Summary

Findings

Uh oh!

PureWeen commented Mar 10, 2026

Code Review — PR #323 (fix: prevent reflect loop semaphore from getting stuck on cleanup hang)

🔴 CRITICAL — Fire-and-forget cleanup races with new reflect loop → silent data loss

🟡 MODERATE — New tests provide zero regression coverage

Verdict

Uh oh!

PureWeen commented Mar 11, 2026

🔧 Status Update

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant