Fix orchestrator delegation infinite loop and empty worker responses#352
Fix orchestrator delegation infinite loop and empty worker responses#352
Conversation
Two fixes for the Evaluate-pr-tests-orchestrator failure pattern: 1. **Completion override loop fix**: When all workers were attempted (even if some failed/returned empty), accept [[GROUP_REFLECT_COMPLETE]] instead of overriding indefinitely. Previously, workers that failed due to SDK bug #299 (missing SessionIdleEvent) kept dispatchedWorkers empty, causing the orchestrator to loop forever with '0 raw assignments'. Changed both the evaluator path and self-eval path to check allWorkersAttempted (not just allWorkersDispatched) as a valid completion condition. 2. **Chat history fallback for empty responses**: When a worker returns empty (common with SDK bug #299 where watchdog completes the session but FlushedResponse/CurrentResponse are empty), extract the last assistant message from chat history as a fallback. The SDK may have streamed content via delta events that ended up in history even though the response buffers were empty. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The inner finally block in SendViaOrchestratorReflectAsync awaited SendPromptAndWaitAsync for leftover queued prompts while still holding the semaphore. If the SDK connection was broken (StreamJsonRpc errors), this await would never complete, permanently blocking all future messages with 'Reflect loop already running'. Fix: Collect leftover prompts synchronously in the finally block, then deliver them in a fire-and-forget task with a 30s timeout AFTER the semaphore is released. The semaphore is now always released promptly regardless of SDK connection health. Also includes PR #352 fixes: - Accept [[GROUP_REFLECT_COMPLETE]] when all workers were attempted (not just succeeded), preventing infinite delegation loops - Chat history fallback for empty worker responses (SDK bug #299) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
left a comment
There was a problem hiding this comment.
PR #352 Review — Multi-Model Consensus (5 models × 2 Opus + Sonnet + Gemini + GPT)
CI Status:
Consensus Findings (2+ model agreement)
🔴 CRITICAL — Thread-unsafe History access (5/5 models)
CopilotService.Organization.cs:~1407
The new chat history fallback reads histState.Info.History.LastOrDefault(...) directly on a background thread. History is a List<ChatMessage> that can be mutated concurrently by SDK event handlers. This risks InvalidOperationException: Collection was modified during enumeration.
The existing code 50 lines below (line 1458) already follows the safe pattern:
// ✅ Existing pattern (line 1458):
var lastAssistant = session.History.ToArray().LastOrDefault(
m => m.Role == "assistant" && m.Timestamp >= dispatchTimeLocal);
// ❌ New code (line 1407):
var lastAssistant = histState.Info.History
.LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));Fix: Use .ToArray() snapshot and add a timestamp filter to avoid returning stale content from previous turns:
var lastAssistant = histState.Info.History.ToArray()
.LastOrDefault(m => m.Role == "assistant"
&& !string.IsNullOrWhiteSpace(m.Content)
&& m.Timestamp >= dispatchTimeLocal);🟡 MODERATE — SafeFireAndForget convention violation (5/5 models)
CopilotService.Organization.cs:~2603
Leftover prompt delivery uses bare _ = Task.Run(async () => { ... }) instead of the mandated SafeFireAndForget() wrapper. Every other fire-and-forget in the codebase uses this wrapper to ensure unhandled exceptions are observed and logged.
Fix:
SafeFireAndForget(Task.Run(async () => { ... }), "[DISPATCH] leftover-prompts-cleanup");🟡 MODERATE — Race condition with leftover prompt delivery (4/5 models)
CopilotService.Organization.cs:~2603
After loopLock.Release(), a new reflection loop for the same group can immediately acquire the semaphore and begin sending prompts to the same orchestrator session. The fire-and-forget leftover delivery is concurrently calling SendPromptAndWaitAsync on that session, risking interleaved/out-of-order messages.
Suggestion: Either re-acquire the semaphore briefly for delivery (with the 30s CTS), or accept the race as best-effort and document it.
�� MODERATE — Stale history fallback (2/5 models)
CopilotService.Organization.cs:~1407
The history fallback has no timestamp filter — it grabs the last assistant message ever, not from the current dispatch. The existing pattern at line 1458 filters m.Timestamp >= dispatchTimeLocal. (Addressed by the fix suggested in Finding #1.)
🟢 MINOR — Duplicate allWorkersAttempted (3/5 models)
CopilotService.Organization.cs:~2433,~2470 — computed identically in two scopes. Could be hoisted next to allWorkersDispatched at line 2170.
Test Coverage
No new tests added. Consider:
- Test that
ExecuteWorkerAsyncreturns history content whenFlushedResponseis empty - Test that completion is accepted when
allWorkersAttemptedbut notallWorkersDispatched - Test that the semaphore is released even when leftover delivery would have hung
Verdict: ⚠️ Request Changes
Core logic is sound and addresses real bugs. Two targeted fixes needed:
- Must fix: Add
.ToArray()+ timestamp filter to history fallback (crash risk) - Must fix: Use
SafeFireAndForget()wrapper (convention) - Nice to have: Document or mitigate the leftover delivery race
Three prompt improvements + structural fallback: 1. Stronger planning prompt: explicit 'NEW request, ignore previous results' instruction when dispatcherOnly=true 2. Stronger nudge: warns about stale results, demands only @worker blocks 3. Forced delegation fallback: when both plan and nudge produce 0 @worker blocks, synthetically create assignments that send the original prompt to all workers. This handles the case where the orchestrator has stale session history and refuses to delegate. Also strengthened the replan prompt with the same fresh-start language. Verified live: orchestrator successfully delegated to 2 workers (10K+ char responses each), produced a 5K char consensus synthesis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Dotnet Skill Validator worker was just describing the methodology from dotnet/skills but never actually running the tool. Updated the system prompt to: - Download skill-validator binary from dotnet/skills nightly releases - Run it against the skill directory with --runs 1 --verbose - Report actual tool output (improvement scores, profile analysis) - Fall back to manual analysis only if the tool can't run Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WsBridgeIntegrationTests: - WaitForAsync now throws TimeoutException on silent timeout instead of falling through to assertions on stale state - Increased rename test wait to 8s for WebSocket round-trip margin ChatDatabaseResilienceTests: - Increased GC delay (200ms→500ms) and added 3 forced Gen2 collections to reliably trigger UnobservedTaskException finalization under load Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
left a comment
There was a problem hiding this comment.
PR #352 Re-Review — Round 2 (5-Model Consensus)
Models: Claude Opus 4.6 ×2, Claude Sonnet 4.6, Gemini 3 Pro, GPT-5.3 Codex
Previous Findings Status
| # | Round 1 Finding | Status | Models |
|---|---|---|---|
| 1 | 🔴 Thread-unsafe History.LastOrDefault() |
5/5 | |
| 2 | 🟡 Stale history fallback (no timestamp filter) | 5/5 | |
| 3 | 🟡 SafeFireAndForget violation (_ = Task.Run(...)) |
5/5 | |
| 4 | 🟡 Race: leftover delivery after semaphore release | ✅ FIXED (semaphore hang resolved) | 3/5 |
| 5 | 🟢 Duplicate allWorkersAttempted |
✅ FIXED | 5/5 |
Open Findings
[🔴 CRITICAL] CopilotService.Organization.cs:~1415 — Thread-unsafe History enumeration (5/5 models)
var lastAssistant = histState.Info.History
.LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));History is a plain List<ChatMessage> mutated by SDK event threads via History.Add(). Concurrent enumeration throws InvalidOperationException or returns corrupted data. The safe pattern already exists 220 lines below at line ~1652: History.ToArray().LastOrDefault(...).
Fix: .ToArray().LastOrDefault(...).
[🟡 MODERATE] CopilotService.Organization.cs:~1415 — History fallback has no timestamp guard (5/5 models)
The LastOrDefault picks the most recent assistant message with no time boundary. If the worker session was reused from a prior orchestration iteration, this silently returns a response from a completely different prompt. The correct pattern at line ~1652 gates with m.Timestamp >= dispatchTimeLocal.
Fix: Capture DateTime.UtcNow before dispatch and add && m.Timestamp >= dispatchTime to the filter.
[🟡 MODERATE] CopilotService.Organization.cs:~2616 — Bare _ = Task.Run(...) violates SafeFireAndForget convention (5/5 models)
_ = Task.Run(async () => { foreach ... { try { ... } catch { ... } } });The inner try/catch handles per-item failures, but exceptions from the outer scaffolding (iterator throw, NullReferenceException, etc.) go unobserved. The codebase convention is SafeFireAndForget() (CopilotService.Utilities.cs:~293).
Fix: SafeFireAndForget(Task.Run(async () => { ... }), "leftover-prompt-delivery");
[🟡 MODERATE] ModelCapabilities.cs:~364 — Skill-validator binary: no integrity check + hardcoded platform (5/5 models)
gh release download skill-validator-nightly --pattern 'skill-validator-osx-arm64.tar.gz' --clobber
tar xzf skill-validator-osx-arm64.tar.gz 2>/dev/null
/tmp/skill-validator <path>Three issues: (1) osx-arm64 hardcoded — silently wrong binary on Linux/Windows workers. (2) No checksum/hash verification — compromised release → arbitrary code execution. (3) 2>/dev/null silences extraction errors.
Suggested: uname -m for platform selection; SHA256 verification against a pinned hash; remove 2>/dev/null.
Clean / Approved Changes
- ✅ Force delegation (line ~2341): Correctly scoped to iteration 1 only after plan+nudge both fail
- ✅ Semaphore fix: Leftover prompts collected synchronously, delivered post-release — no more semaphore hang
- ✅ WsBridgeIntegrationTests:
WaitForAsyncnow throwsTimeoutException— genuine fix for misleading test failures - ✅ ChatDatabaseResilienceTests: Triple
GC.Collect(2, Forced, blocking: true)is the canonical finalizer pipeline pattern - ✅ Duplicate
allWorkersAttempted: Now correctly computed inline at both callsites in separate scopes
Verdict: ⚠️ Request Changes
3 unfixed findings from Round 1 remain (thread-safety crash risk + silent stale data corruption + convention violation). The thread-unsafe History.LastOrDefault() is the highest priority — it's a runtime crash risk under concurrent SDK events.
Review by 5-model AI consensus (Round 2)
1. Thread-safe History access: .ToArray() snapshot before .LastOrDefault() to prevent InvalidOperationException from concurrent SDK mutations 2. Timestamp guard on history fallback: only return assistant messages from the current dispatch, not stale content from prior turns 3. SafeFireAndForget wrapper for leftover prompt delivery instead of bare _ = Task.Run(...) — matches codebase convention 4. Platform-aware skill-validator download: uname detection for osx-arm64/osx-x64/linux-arm64/linux-x64 instead of hardcoded macOS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
left a comment
There was a problem hiding this comment.
PR #352 Re-Review — Round 3 ✅
Commit reviewed: 7ad09691 "Address PR #352 review findings (Round 2)"
All 4 Round 2 Findings: FIXED ✅
| # | Finding | Status | Evidence |
|---|---|---|---|
| 1 | 🔴 Thread-unsafe History.LastOrDefault() |
✅ FIXED | Now uses History.ToArray().LastOrDefault(...) — snapshot before enumeration |
| 2 | 🟡 No timestamp filter on history fallback | ✅ FIXED | var dispatchTime = DateTime.UtcNow captured before dispatch + m.Timestamp >= dispatchTime filter applied |
| 3 | 🟡 Bare _ = Task.Run(...) convention violation |
✅ FIXED | Now wrapped with SafeFireAndForget(Task.Run(...), "leftover-prompt-delivery") |
| 4 | 🟡 Skill-validator hardcoded platform + silent errors | ✅ FIXED | Full uname -s/uname -m detection (Darwin-arm64, Darwin-x86_64, Linux-aarch64, Linux-x86_64) with exit 1 for unsupported. tar xzf no longer silences errors. |
Design Review (unchanged from Round 2 — all clean)
- ✅ Force delegation correctly scoped to iteration 1 only (after plan + nudge both fail)
- ✅ Semaphore fix: leftover prompts collected synchronously in
finally, delivered after release with 30s CancellationTokenSource - ✅
WaitForAsyncnow throwsTimeoutException— genuine fix for misleading test failures - ✅ GC triple-collect with
Forced, blocking: trueis the canonical finalizer pipeline pattern - ✅
allWorkersAttemptedcorrectly computed inline at both callsites in separate scopes
Minor Notes (not blocking)
- No SHA256 checksum on skill-validator download —
gh release downloadfrom authenticated GitHub provides reasonable integrity. Could add in a follow-up. - Leftover delivery theoretically races with a new reflect loop on the same orchestrator session, but the 30s timeout +
SafeFireAndForgetwrapper limits blast radius. Acceptable tradeoff.
Verdict: ✅ Approve — All findings addressed. Clean fixes, no new issues introduced.
Review by AI consensus (Round 3)
Apply all 6 improvements from the Skill Validator consensus report (6.5/10 → IMPROVE): SKILL.md: - Add code example for INV-3 (generation guard pattern) - Clarify INV-7: explicit guidance for new cross-thread fields vs pre-existing gap - Add 'Diagnosing a Stuck Session' section with symptom table and diagnostic steps - Add symptom-based trigger phrases to description (spinner stuck, hung after tool use) - Add test file pointer: run ProcessingWatchdogTests.cs after changes ModelCapabilities.cs: - Add optional Claude Code live trigger testing to Anthropic evaluator prompt - Handle auth failures gracefully (skip and note, don't retry) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 3rd worker (Eval Generator) to the Skill Validator multi-agent preset: - Worker-3 reads SKILL.md and generates tests/eval.yaml with positive/negative scenarios - Updated RoutingContext for 2-phase dispatch (generate evals → run evaluators) - Updated SharedContext with eval.yaml format reference - Updated test assertions for 3 workers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The orchestrator was using task()/bash()/view() tools itself instead of emitting @worker:/@EnD blocks for PolyPilot to dispatch to worker sessions. Changes to RoutingContext: - Added explicit anti-self-work warning at the top - Listed full worker session names (not just 'worker-1') - Added concrete @worker block examples for each phase - Removed 'check if eval.yaml exists' instruction that invited tool use - Added 'NEVER use tools yourself' rule Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urns The Copilot CLI agent has full tool access and ignores 'dispatcher only' instructions, doing all the work itself via task()/bash()/view() before the orchestrator response completes. By the time ParseTaskAssignments runs, the @worker blocks are buried in a multi-minute tool session. Fix: FlushCurrentResponse now checks accumulated FlushedResponse for @worker:...@EnD blocks when EarlyDispatchOnWorkerBlocks is set. If found, it resolves the ResponseCompletion TCS immediately so ParseTaskAssignments can extract and dispatch workers while the orchestrator is still running. The flag is set for all orchestrator planning prompts (reflect, non-reflect, and nudge paths). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workers now use preset-defined display names (e.g., dotnet-validator, anthropic-validator, eval-generator) instead of generic worker-1/2/3. Updated RoutingContext examples and consensus rules to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After early dispatch resolves the TCS, the orchestrator continues doing tool work. The reflect loop must wait for it to go idle before sending the synthesis prompt, otherwise SendPromptAsync throws 'already processing'. Added WaitForSessionIdleAsync helper used in both reflect and non-reflect flows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After early dispatch, the orchestrator may hit the zero-idle bug where the SDK never emits SessionIdleEvent. Instead of waiting the full 300s timeout, check LastEventAtTicks — if no SDK events for 60s, abort the orchestrator and proceed to synthesis immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the reflect loop's iteration > 1 returns 0 assignments, the code used to declare GoalMet immediately. But with 2-phase dispatch, only the eval-generator ran in Phase 1 — the orchestrator's replan response didn't write @worker blocks for the validators. Now we check if all workers have been dispatched and force-dispatch remaining ones if not. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
The Evaluate-pr-tests-orchestrator fails to complete with two symptoms:
Root Cause 1: Completion Override Infinite Loop
When workers fail/return empty (SDK bug #299),
dispatchedWorkersstays empty →allWorkersDispatched = false→[[GROUP_REFLECT_COMPLETE]]is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations.Root Cause 2: Empty Worker Responses
Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but
FlushedResponse/CurrentResponseare empty because content was in delta events that ended up in chat history but not the response buffers.Fixes
1. Accept completion when all workers were attempted
Changed both evaluator and self-eval paths: if all workers were attempted (even if some failed/returned empty), accept
[[GROUP_REFLECT_COMPLETE]]instead of overriding it. UsesallWorkersAttempted(attemptedWorkersset) alongsideallWorkersDispatched(dispatchedWorkersset).2. Chat history fallback for empty responses
When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers.
Testing
Related