Fix orchestrator delegation infinite loop and empty worker responses by PureWeen · Pull Request #352 · PureWeen/PolyPilot

PureWeen · 2026-03-11T12:36:25Z

Problem

The Evaluate-pr-tests-orchestrator fails to complete with two symptoms:

Orchestrator loops indefinitely with "0 raw assignments"
Workers return empty responses despite having processed content

Root Cause 1: Completion Override Infinite Loop

When workers fail/return empty (SDK bug #299), dispatchedWorkers stays empty → allWorkersDispatched = false → [[GROUP_REFLECT_COMPLETE]] is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations.

Root Cause 2: Empty Worker Responses

Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but FlushedResponse/CurrentResponse are empty because content was in delta events that ended up in chat history but not the response buffers.

Fixes

1. Accept completion when all workers were attempted

Changed both evaluator and self-eval paths: if all workers were attempted (even if some failed/returned empty), accept [[GROUP_REFLECT_COMPLETE]] instead of overriding it. Uses allWorkersAttempted (attemptedWorkers set) alongside allWorkersDispatched (dispatchedWorkers set).

2. Chat history fallback for empty responses

When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers.

Testing

All 2422 tests pass
Built and relaunched successfully

Two fixes for the Evaluate-pr-tests-orchestrator failure pattern: 1. **Completion override loop fix**: When all workers were attempted (even if some failed/returned empty), accept [[GROUP_REFLECT_COMPLETE]] instead of overriding indefinitely. Previously, workers that failed due to SDK bug #299 (missing SessionIdleEvent) kept dispatchedWorkers empty, causing the orchestrator to loop forever with '0 raw assignments'. Changed both the evaluator path and self-eval path to check allWorkersAttempted (not just allWorkersDispatched) as a valid completion condition. 2. **Chat history fallback for empty responses**: When a worker returns empty (common with SDK bug #299 where watchdog completes the session but FlushedResponse/CurrentResponse are empty), extract the last assistant message from chat history as a fallback. The SDK may have streamed content via delta events that ended up in history even though the response buffers were empty. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The inner finally block in SendViaOrchestratorReflectAsync awaited SendPromptAndWaitAsync for leftover queued prompts while still holding the semaphore. If the SDK connection was broken (StreamJsonRpc errors), this await would never complete, permanently blocking all future messages with 'Reflect loop already running'. Fix: Collect leftover prompts synchronously in the finally block, then deliver them in a fire-and-forget task with a 30s timeout AFTER the semaphore is released. The semaphore is now always released promptly regardless of SDK connection health. Also includes PR #352 fixes: - Accept [[GROUP_REFLECT_COMPLETE]] when all workers were attempted (not just succeeded), preventing infinite delegation loops - Chat history fallback for empty worker responses (SDK bug #299) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen

PR #352 Review — Multi-Model Consensus (5 models × 2 Opus + Sonnet + Gemini + GPT)

CI Status: ⚠️ No checks reported yet

Consensus Findings (2+ model agreement)

🔴 CRITICAL — Thread-unsafe History access (5/5 models)
CopilotService.Organization.cs:~1407

The new chat history fallback reads histState.Info.History.LastOrDefault(...) directly on a background thread. History is a List<ChatMessage> that can be mutated concurrently by SDK event handlers. This risks InvalidOperationException: Collection was modified during enumeration.

The existing code 50 lines below (line 1458) already follows the safe pattern:

// ✅ Existing pattern (line 1458):
var lastAssistant = session.History.ToArray().LastOrDefault(
    m => m.Role == "assistant" && m.Timestamp >= dispatchTimeLocal);

// ❌ New code (line 1407):
var lastAssistant = histState.Info.History
    .LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));

Fix: Use .ToArray() snapshot and add a timestamp filter to avoid returning stale content from previous turns:

var lastAssistant = histState.Info.History.ToArray()
    .LastOrDefault(m => m.Role == "assistant" 
        && !string.IsNullOrWhiteSpace(m.Content)
        && m.Timestamp >= dispatchTimeLocal);

🟡 MODERATE — SafeFireAndForget convention violation (5/5 models)
CopilotService.Organization.cs:~2603

Leftover prompt delivery uses bare _ = Task.Run(async () => { ... }) instead of the mandated SafeFireAndForget() wrapper. Every other fire-and-forget in the codebase uses this wrapper to ensure unhandled exceptions are observed and logged.

Fix:

SafeFireAndForget(Task.Run(async () => { ... }), "[DISPATCH] leftover-prompts-cleanup");

🟡 MODERATE — Race condition with leftover prompt delivery (4/5 models)
CopilotService.Organization.cs:~2603

After loopLock.Release(), a new reflection loop for the same group can immediately acquire the semaphore and begin sending prompts to the same orchestrator session. The fire-and-forget leftover delivery is concurrently calling SendPromptAndWaitAsync on that session, risking interleaved/out-of-order messages.

Suggestion: Either re-acquire the semaphore briefly for delivery (with the 30s CTS), or accept the race as best-effort and document it.

�� MODERATE — Stale history fallback (2/5 models)
CopilotService.Organization.cs:~1407

The history fallback has no timestamp filter — it grabs the last assistant message ever, not from the current dispatch. The existing pattern at line 1458 filters m.Timestamp >= dispatchTimeLocal. (Addressed by the fix suggested in Finding #1.)

🟢 MINOR — Duplicate allWorkersAttempted (3/5 models)
CopilotService.Organization.cs:~2433,~2470 — computed identically in two scopes. Could be hoisted next to allWorkersDispatched at line 2170.

Test Coverage

No new tests added. Consider:

Test that ExecuteWorkerAsync returns history content when FlushedResponse is empty
Test that completion is accepted when allWorkersAttempted but not allWorkersDispatched
Test that the semaphore is released even when leftover delivery would have hung

Verdict: ⚠️ Request Changes

Core logic is sound and addresses real bugs. Two targeted fixes needed:

Must fix: Add .ToArray() + timestamp filter to history fallback (crash risk)
Must fix: Use SafeFireAndForget() wrapper (convention)
Nice to have: Document or mitigate the leftover delivery race

@worker

Three prompt improvements + structural fallback: 1. Stronger planning prompt: explicit 'NEW request, ignore previous results' instruction when dispatcherOnly=true 2. Stronger nudge: warns about stale results, demands only @worker blocks 3. Forced delegation fallback: when both plan and nudge produce 0 @worker blocks, synthetically create assignments that send the original prompt to all workers. This handles the case where the orchestrator has stale session history and refuses to delegate. Also strengthened the replan prompt with the same fresh-start language. Verified live: orchestrator successfully delegated to 2 workers (10K+ char responses each), produced a 5K char consensus synthesis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Dotnet Skill Validator worker was just describing the methodology from dotnet/skills but never actually running the tool. Updated the system prompt to: - Download skill-validator binary from dotnet/skills nightly releases - Run it against the skill directory with --runs 1 --verbose - Report actual tool output (improvement scores, profile analysis) - Fall back to manual analysis only if the tool can't run Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

WsBridgeIntegrationTests: - WaitForAsync now throws TimeoutException on silent timeout instead of falling through to assertions on stale state - Increased rename test wait to 8s for WebSocket round-trip margin ChatDatabaseResilienceTests: - Increased GC delay (200ms→500ms) and added 3 forced Gen2 collections to reliably trigger UnobservedTaskException finalization under load Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen

PR #352 Re-Review — Round 2 (5-Model Consensus)

Models: Claude Opus 4.6 ×2, Claude Sonnet 4.6, Gemini 3 Pro, GPT-5.3 Codex

Previous Findings Status

#	Round 1 Finding	Status	Models
1	🔴 Thread-unsafe `History.LastOrDefault()`	⚠️ STILL OPEN	5/5
2	🟡 Stale history fallback (no timestamp filter)	⚠️ STILL OPEN	5/5
3	🟡 `SafeFireAndForget` violation (`_ = Task.Run(...)`)	⚠️ STILL OPEN	5/5
4	🟡 Race: leftover delivery after semaphore release	✅ FIXED (semaphore hang resolved)	3/5
5	🟢 Duplicate `allWorkersAttempted`	✅ FIXED	5/5

Open Findings

[🔴 CRITICAL] CopilotService.Organization.cs:~1415 — Thread-unsafe History enumeration (5/5 models)

var lastAssistant = histState.Info.History
    .LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));

History is a plain List<ChatMessage> mutated by SDK event threads via History.Add(). Concurrent enumeration throws InvalidOperationException or returns corrupted data. The safe pattern already exists 220 lines below at line ~1652: History.ToArray().LastOrDefault(...).

Fix: .ToArray().LastOrDefault(...).

[🟡 MODERATE] CopilotService.Organization.cs:~1415 — History fallback has no timestamp guard (5/5 models)

The LastOrDefault picks the most recent assistant message with no time boundary. If the worker session was reused from a prior orchestration iteration, this silently returns a response from a completely different prompt. The correct pattern at line ~1652 gates with m.Timestamp >= dispatchTimeLocal.

Fix: Capture DateTime.UtcNow before dispatch and add && m.Timestamp >= dispatchTime to the filter.

[🟡 MODERATE] CopilotService.Organization.cs:~2616 — Bare _ = Task.Run(...) violates SafeFireAndForget convention (5/5 models)

_ = Task.Run(async () => { foreach ... { try { ... } catch { ... } } });

The inner try/catch handles per-item failures, but exceptions from the outer scaffolding (iterator throw, NullReferenceException, etc.) go unobserved. The codebase convention is SafeFireAndForget() (CopilotService.Utilities.cs:~293).

Fix: SafeFireAndForget(Task.Run(async () => { ... }), "leftover-prompt-delivery");

[🟡 MODERATE] ModelCapabilities.cs:~364 — Skill-validator binary: no integrity check + hardcoded platform (5/5 models)

gh release download skill-validator-nightly --pattern 'skill-validator-osx-arm64.tar.gz' --clobber
tar xzf skill-validator-osx-arm64.tar.gz 2>/dev/null
/tmp/skill-validator <path>

Three issues: (1) osx-arm64 hardcoded — silently wrong binary on Linux/Windows workers. (2) No checksum/hash verification — compromised release → arbitrary code execution. (3) 2>/dev/null silences extraction errors.

Suggested: uname -m for platform selection; SHA256 verification against a pinned hash; remove 2>/dev/null.

Clean / Approved Changes

✅ Force delegation (line ~2341): Correctly scoped to iteration 1 only after plan+nudge both fail
✅ Semaphore fix: Leftover prompts collected synchronously, delivered post-release — no more semaphore hang
✅ WsBridgeIntegrationTests: WaitForAsync now throws TimeoutException — genuine fix for misleading test failures
✅ ChatDatabaseResilienceTests: Triple GC.Collect(2, Forced, blocking: true) is the canonical finalizer pipeline pattern
✅ Duplicate allWorkersAttempted: Now correctly computed inline at both callsites in separate scopes

Verdict: ⚠️ Request Changes

3 unfixed findings from Round 1 remain (thread-safety crash risk + silent stale data corruption + convention violation). The thread-unsafe History.LastOrDefault() is the highest priority — it's a runtime crash risk under concurrent SDK events.

Review by 5-model AI consensus (Round 2)

1. Thread-safe History access: .ToArray() snapshot before .LastOrDefault() to prevent InvalidOperationException from concurrent SDK mutations 2. Timestamp guard on history fallback: only return assistant messages from the current dispatch, not stale content from prior turns 3. SafeFireAndForget wrapper for leftover prompt delivery instead of bare _ = Task.Run(...) — matches codebase convention 4. Platform-aware skill-validator download: uname detection for osx-arm64/osx-x64/linux-arm64/linux-x64 instead of hardcoded macOS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen

PR #352 Re-Review — Round 3 ✅

Commit reviewed: 7ad09691 "Address PR #352 review findings (Round 2)"

All 4 Round 2 Findings: FIXED ✅

#	Finding	Status	Evidence
1	🔴 Thread-unsafe `History.LastOrDefault()`	✅ FIXED	Now uses `History.ToArray().LastOrDefault(...)` — snapshot before enumeration
2	🟡 No timestamp filter on history fallback	✅ FIXED	`var dispatchTime = DateTime.UtcNow` captured before dispatch + `m.Timestamp >= dispatchTime` filter applied
3	🟡 Bare `_ = Task.Run(...)` convention violation	✅ FIXED	Now wrapped with `SafeFireAndForget(Task.Run(...), "leftover-prompt-delivery")`
4	🟡 Skill-validator hardcoded platform + silent errors	✅ FIXED	Full `uname -s`/`uname -m` detection (Darwin-arm64, Darwin-x86_64, Linux-aarch64, Linux-x86_64) with `exit 1` for unsupported. `tar xzf` no longer silences errors.

Design Review (unchanged from Round 2 — all clean)

✅ Force delegation correctly scoped to iteration 1 only (after plan + nudge both fail)
✅ Semaphore fix: leftover prompts collected synchronously in finally, delivered after release with 30s CancellationTokenSource
✅ WaitForAsync now throws TimeoutException — genuine fix for misleading test failures
✅ GC triple-collect with Forced, blocking: true is the canonical finalizer pipeline pattern
✅ allWorkersAttempted correctly computed inline at both callsites in separate scopes

Minor Notes (not blocking)

No SHA256 checksum on skill-validator download — gh release download from authenticated GitHub provides reasonable integrity. Could add in a follow-up.
Leftover delivery theoretically races with a new reflect loop on the same orchestrator session, but the 30s timeout + SafeFireAndForget wrapper limits blast radius. Acceptable tradeoff.

Verdict: ✅ Approve — All findings addressed. Clean fixes, no new issues introduced.

Review by AI consensus (Round 3)

Apply all 6 improvements from the Skill Validator consensus report (6.5/10 → IMPROVE): SKILL.md: - Add code example for INV-3 (generation guard pattern) - Clarify INV-7: explicit guidance for new cross-thread fields vs pre-existing gap - Add 'Diagnosing a Stuck Session' section with symptom table and diagnostic steps - Add symptom-based trigger phrases to description (spinner stuck, hung after tool use) - Add test file pointer: run ProcessingWatchdogTests.cs after changes ModelCapabilities.cs: - Add optional Claude Code live trigger testing to Anthropic evaluator prompt - Handle auth failures gracefully (skip and note, don't retry) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 3rd worker (Eval Generator) to the Skill Validator multi-agent preset: - Worker-3 reads SKILL.md and generates tests/eval.yaml with positive/negative scenarios - Updated RoutingContext for 2-phase dispatch (generate evals → run evaluators) - Updated SharedContext with eval.yaml format reference - Updated test assertions for 3 workers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@worker

The orchestrator was using task()/bash()/view() tools itself instead of emitting @worker:/@EnD blocks for PolyPilot to dispatch to worker sessions. Changes to RoutingContext: - Added explicit anti-self-work warning at the top - Listed full worker session names (not just 'worker-1') - Added concrete @worker block examples for each phase - Removed 'check if eval.yaml exists' instruction that invited tool use - Added 'NEVER use tools yourself' rule Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@worker

…urns The Copilot CLI agent has full tool access and ignores 'dispatcher only' instructions, doing all the work itself via task()/bash()/view() before the orchestrator response completes. By the time ParseTaskAssignments runs, the @worker blocks are buried in a multi-minute tool session. Fix: FlushCurrentResponse now checks accumulated FlushedResponse for @worker:...@EnD blocks when EarlyDispatchOnWorkerBlocks is set. If found, it resolves the ResponseCompletion TCS immediately so ParseTaskAssignments can extract and dispatch workers while the orchestrator is still running. The flag is set for all orchestrator planning prompts (reflect, non-reflect, and nudge paths). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Workers now use preset-defined display names (e.g., dotnet-validator, anthropic-validator, eval-generator) instead of generic worker-1/2/3. Updated RoutingContext examples and consensus rules to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After early dispatch resolves the TCS, the orchestrator continues doing tool work. The reflect loop must wait for it to go idle before sending the synthesis prompt, otherwise SendPromptAsync throws 'already processing'. Added WaitForSessionIdleAsync helper used in both reflect and non-reflect flows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After early dispatch, the orchestrator may hit the zero-idle bug where the SDK never emits SessionIdleEvent. Instead of waiting the full 300s timeout, check LastEventAtTicks — if no SDK events for 60s, abort the orchestrator and proceed to synthesis immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@worker

When the reflect loop's iteration > 1 returns 0 assignments, the code used to declare GoalMet immediately. But with 2-phase dispatch, only the eval-generator ran in Phase 1 — the orchestrator's replan response didn't write @worker blocks for the validators. Now we check if all workers have been dispatched and force-dispatch remaining ones if not. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 2 commits March 11, 2026 07:36

PureWeen mentioned this pull request Mar 11, 2026

fix: prevent reflect loop semaphore from getting stuck on cleanup hang #323

Closed

PureWeen commented Mar 11, 2026

View reviewed changes

PureWeen and others added 3 commits March 11, 2026 09:37

PureWeen commented Mar 11, 2026

View reviewed changes

PureWeen and others added 8 commits March 11, 2026 15:13

PureWeen merged commit e5d2bc0 into main Mar 12, 2026

PureWeen deleted the fix/orchestrator-delegation-failures branch March 12, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix orchestrator delegation infinite loop and empty worker responses#352

Fix orchestrator delegation infinite loop and empty worker responses#352
PureWeen merged 14 commits intomainfrom
fix/orchestrator-delegation-failures

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen left a comment

Uh oh!

PureWeen left a comment

Uh oh!

PureWeen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 11, 2026

Problem

Root Cause 1: Completion Override Infinite Loop

Root Cause 2: Empty Worker Responses

Fixes

1. Accept completion when all workers were attempted

2. Chat history fallback for empty responses

Testing

Related

Uh oh!

PureWeen left a comment

Choose a reason for hiding this comment

PR #352 Review — Multi-Model Consensus (5 models × 2 Opus + Sonnet + Gemini + GPT)

Consensus Findings (2+ model agreement)

Test Coverage

Verdict: ⚠️ Request Changes

Uh oh!

PureWeen left a comment

Choose a reason for hiding this comment

PR #352 Re-Review — Round 2 (5-Model Consensus)

Previous Findings Status

Open Findings

Clean / Approved Changes

Verdict: ⚠️ Request Changes

Uh oh!

PureWeen left a comment

Choose a reason for hiding this comment

PR #352 Re-Review — Round 3 ✅

All 4 Round 2 Findings: FIXED ✅

Design Review (unchanged from Round 2 — all clean)

Minor Notes (not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant