Skip to content

Fix orchestrator delegation infinite loop and empty worker responses#352

Merged
PureWeen merged 14 commits intomainfrom
fix/orchestrator-delegation-failures
Mar 12, 2026
Merged

Fix orchestrator delegation infinite loop and empty worker responses#352
PureWeen merged 14 commits intomainfrom
fix/orchestrator-delegation-failures

Conversation

@PureWeen
Copy link
Owner

Problem

The Evaluate-pr-tests-orchestrator fails to complete with two symptoms:

  1. Orchestrator loops indefinitely with "0 raw assignments"
  2. Workers return empty responses despite having processed content

Root Cause 1: Completion Override Infinite Loop

When workers fail/return empty (SDK bug #299), dispatchedWorkers stays empty → allWorkersDispatched = false[[GROUP_REFLECT_COMPLETE]] is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations.

Root Cause 2: Empty Worker Responses

Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but FlushedResponse/CurrentResponse are empty because content was in delta events that ended up in chat history but not the response buffers.

Fixes

1. Accept completion when all workers were attempted

Changed both evaluator and self-eval paths: if all workers were attempted (even if some failed/returned empty), accept [[GROUP_REFLECT_COMPLETE]] instead of overriding it. Uses allWorkersAttempted (attemptedWorkers set) alongside allWorkersDispatched (dispatchedWorkers set).

2. Chat history fallback for empty responses

When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers.

Testing

  • All 2422 tests pass
  • Built and relaunched successfully

Related

PureWeen and others added 2 commits March 11, 2026 07:36
Two fixes for the Evaluate-pr-tests-orchestrator failure pattern:

1. **Completion override loop fix**: When all workers were attempted (even
   if some failed/returned empty), accept [[GROUP_REFLECT_COMPLETE]] instead
   of overriding indefinitely. Previously, workers that failed due to SDK
   bug #299 (missing SessionIdleEvent) kept dispatchedWorkers empty, causing
   the orchestrator to loop forever with '0 raw assignments'.

   Changed both the evaluator path and self-eval path to check
   allWorkersAttempted (not just allWorkersDispatched) as a valid
   completion condition.

2. **Chat history fallback for empty responses**: When a worker returns
   empty (common with SDK bug #299 where watchdog completes the session
   but FlushedResponse/CurrentResponse are empty), extract the last
   assistant message from chat history as a fallback. The SDK may have
   streamed content via delta events that ended up in history even though
   the response buffers were empty.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The inner finally block in SendViaOrchestratorReflectAsync awaited
SendPromptAndWaitAsync for leftover queued prompts while still holding
the semaphore. If the SDK connection was broken (StreamJsonRpc errors),
this await would never complete, permanently blocking all future
messages with 'Reflect loop already running'.

Fix: Collect leftover prompts synchronously in the finally block, then
deliver them in a fire-and-forget task with a 30s timeout AFTER the
semaphore is released. The semaphore is now always released promptly
regardless of SDK connection health.

Also includes PR #352 fixes:
- Accept [[GROUP_REFLECT_COMPLETE]] when all workers were attempted
  (not just succeeded), preventing infinite delegation loops
- Chat history fallback for empty worker responses (SDK bug #299)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Owner Author

@PureWeen PureWeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #352 Review — Multi-Model Consensus (5 models × 2 Opus + Sonnet + Gemini + GPT)

CI Status: ⚠️ No checks reported yet

Consensus Findings (2+ model agreement)

🔴 CRITICAL — Thread-unsafe History access (5/5 models)
CopilotService.Organization.cs:~1407

The new chat history fallback reads histState.Info.History.LastOrDefault(...) directly on a background thread. History is a List<ChatMessage> that can be mutated concurrently by SDK event handlers. This risks InvalidOperationException: Collection was modified during enumeration.

The existing code 50 lines below (line 1458) already follows the safe pattern:

// ✅ Existing pattern (line 1458):
var lastAssistant = session.History.ToArray().LastOrDefault(
    m => m.Role == "assistant" && m.Timestamp >= dispatchTimeLocal);

// ❌ New code (line 1407):
var lastAssistant = histState.Info.History
    .LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));

Fix: Use .ToArray() snapshot and add a timestamp filter to avoid returning stale content from previous turns:

var lastAssistant = histState.Info.History.ToArray()
    .LastOrDefault(m => m.Role == "assistant" 
        && !string.IsNullOrWhiteSpace(m.Content)
        && m.Timestamp >= dispatchTimeLocal);

🟡 MODERATE — SafeFireAndForget convention violation (5/5 models)
CopilotService.Organization.cs:~2603

Leftover prompt delivery uses bare _ = Task.Run(async () => { ... }) instead of the mandated SafeFireAndForget() wrapper. Every other fire-and-forget in the codebase uses this wrapper to ensure unhandled exceptions are observed and logged.

Fix:

SafeFireAndForget(Task.Run(async () => { ... }), "[DISPATCH] leftover-prompts-cleanup");

🟡 MODERATE — Race condition with leftover prompt delivery (4/5 models)
CopilotService.Organization.cs:~2603

After loopLock.Release(), a new reflection loop for the same group can immediately acquire the semaphore and begin sending prompts to the same orchestrator session. The fire-and-forget leftover delivery is concurrently calling SendPromptAndWaitAsync on that session, risking interleaved/out-of-order messages.

Suggestion: Either re-acquire the semaphore briefly for delivery (with the 30s CTS), or accept the race as best-effort and document it.


�� MODERATE — Stale history fallback (2/5 models)
CopilotService.Organization.cs:~1407

The history fallback has no timestamp filter — it grabs the last assistant message ever, not from the current dispatch. The existing pattern at line 1458 filters m.Timestamp >= dispatchTimeLocal. (Addressed by the fix suggested in Finding #1.)


🟢 MINOR — Duplicate allWorkersAttempted (3/5 models)
CopilotService.Organization.cs:~2433,~2470 — computed identically in two scopes. Could be hoisted next to allWorkersDispatched at line 2170.


Test Coverage

No new tests added. Consider:

  1. Test that ExecuteWorkerAsync returns history content when FlushedResponse is empty
  2. Test that completion is accepted when allWorkersAttempted but not allWorkersDispatched
  3. Test that the semaphore is released even when leftover delivery would have hung

Verdict: ⚠️ Request Changes

Core logic is sound and addresses real bugs. Two targeted fixes needed:

  1. Must fix: Add .ToArray() + timestamp filter to history fallback (crash risk)
  2. Must fix: Use SafeFireAndForget() wrapper (convention)
  3. Nice to have: Document or mitigate the leftover delivery race

PureWeen and others added 3 commits March 11, 2026 09:37
Three prompt improvements + structural fallback:

1. Stronger planning prompt: explicit 'NEW request, ignore previous
   results' instruction when dispatcherOnly=true
2. Stronger nudge: warns about stale results, demands only @worker blocks
3. Forced delegation fallback: when both plan and nudge produce 0
   @worker blocks, synthetically create assignments that send the
   original prompt to all workers. This handles the case where the
   orchestrator has stale session history and refuses to delegate.

Also strengthened the replan prompt with the same fresh-start language.

Verified live: orchestrator successfully delegated to 2 workers
(10K+ char responses each), produced a 5K char consensus synthesis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Dotnet Skill Validator worker was just describing the methodology
from dotnet/skills but never actually running the tool. Updated the
system prompt to:

- Download skill-validator binary from dotnet/skills nightly releases
- Run it against the skill directory with --runs 1 --verbose
- Report actual tool output (improvement scores, profile analysis)
- Fall back to manual analysis only if the tool can't run

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WsBridgeIntegrationTests:
- WaitForAsync now throws TimeoutException on silent timeout instead
  of falling through to assertions on stale state
- Increased rename test wait to 8s for WebSocket round-trip margin

ChatDatabaseResilienceTests:
- Increased GC delay (200ms→500ms) and added 3 forced Gen2 collections
  to reliably trigger UnobservedTaskException finalization under load

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Owner Author

@PureWeen PureWeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #352 Re-Review — Round 2 (5-Model Consensus)

Models: Claude Opus 4.6 ×2, Claude Sonnet 4.6, Gemini 3 Pro, GPT-5.3 Codex

Previous Findings Status

# Round 1 Finding Status Models
1 🔴 Thread-unsafe History.LastOrDefault() ⚠️ STILL OPEN 5/5
2 🟡 Stale history fallback (no timestamp filter) ⚠️ STILL OPEN 5/5
3 🟡 SafeFireAndForget violation (_ = Task.Run(...)) ⚠️ STILL OPEN 5/5
4 🟡 Race: leftover delivery after semaphore release ✅ FIXED (semaphore hang resolved) 3/5
5 🟢 Duplicate allWorkersAttempted ✅ FIXED 5/5

Open Findings

[🔴 CRITICAL] CopilotService.Organization.cs:~1415 — Thread-unsafe History enumeration (5/5 models)

var lastAssistant = histState.Info.History
    .LastOrDefault(m => m.Role == "assistant" && !string.IsNullOrWhiteSpace(m.Content));

History is a plain List<ChatMessage> mutated by SDK event threads via History.Add(). Concurrent enumeration throws InvalidOperationException or returns corrupted data. The safe pattern already exists 220 lines below at line ~1652: History.ToArray().LastOrDefault(...).

Fix: .ToArray().LastOrDefault(...).


[🟡 MODERATE] CopilotService.Organization.cs:~1415 — History fallback has no timestamp guard (5/5 models)

The LastOrDefault picks the most recent assistant message with no time boundary. If the worker session was reused from a prior orchestration iteration, this silently returns a response from a completely different prompt. The correct pattern at line ~1652 gates with m.Timestamp >= dispatchTimeLocal.

Fix: Capture DateTime.UtcNow before dispatch and add && m.Timestamp >= dispatchTime to the filter.


[🟡 MODERATE] CopilotService.Organization.cs:~2616 — Bare _ = Task.Run(...) violates SafeFireAndForget convention (5/5 models)

_ = Task.Run(async () => { foreach ... { try { ... } catch { ... } } });

The inner try/catch handles per-item failures, but exceptions from the outer scaffolding (iterator throw, NullReferenceException, etc.) go unobserved. The codebase convention is SafeFireAndForget() (CopilotService.Utilities.cs:~293).

Fix: SafeFireAndForget(Task.Run(async () => { ... }), "leftover-prompt-delivery");


[🟡 MODERATE] ModelCapabilities.cs:~364 — Skill-validator binary: no integrity check + hardcoded platform (5/5 models)

gh release download skill-validator-nightly --pattern 'skill-validator-osx-arm64.tar.gz' --clobber
tar xzf skill-validator-osx-arm64.tar.gz 2>/dev/null
/tmp/skill-validator <path>

Three issues: (1) osx-arm64 hardcoded — silently wrong binary on Linux/Windows workers. (2) No checksum/hash verification — compromised release → arbitrary code execution. (3) 2>/dev/null silences extraction errors.

Suggested: uname -m for platform selection; SHA256 verification against a pinned hash; remove 2>/dev/null.


Clean / Approved Changes

  • Force delegation (line ~2341): Correctly scoped to iteration 1 only after plan+nudge both fail
  • Semaphore fix: Leftover prompts collected synchronously, delivered post-release — no more semaphore hang
  • WsBridgeIntegrationTests: WaitForAsync now throws TimeoutException — genuine fix for misleading test failures
  • ChatDatabaseResilienceTests: Triple GC.Collect(2, Forced, blocking: true) is the canonical finalizer pipeline pattern
  • Duplicate allWorkersAttempted: Now correctly computed inline at both callsites in separate scopes

Verdict: ⚠️ Request Changes

3 unfixed findings from Round 1 remain (thread-safety crash risk + silent stale data corruption + convention violation). The thread-unsafe History.LastOrDefault() is the highest priority — it's a runtime crash risk under concurrent SDK events.

Review by 5-model AI consensus (Round 2)

1. Thread-safe History access: .ToArray() snapshot before .LastOrDefault()
   to prevent InvalidOperationException from concurrent SDK mutations
2. Timestamp guard on history fallback: only return assistant messages
   from the current dispatch, not stale content from prior turns
3. SafeFireAndForget wrapper for leftover prompt delivery instead of
   bare _ = Task.Run(...) — matches codebase convention
4. Platform-aware skill-validator download: uname detection for
   osx-arm64/osx-x64/linux-arm64/linux-x64 instead of hardcoded macOS

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Owner Author

@PureWeen PureWeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #352 Re-Review — Round 3 ✅

Commit reviewed: 7ad09691 "Address PR #352 review findings (Round 2)"

All 4 Round 2 Findings: FIXED ✅

# Finding Status Evidence
1 🔴 Thread-unsafe History.LastOrDefault() FIXED Now uses History.ToArray().LastOrDefault(...) — snapshot before enumeration
2 🟡 No timestamp filter on history fallback FIXED var dispatchTime = DateTime.UtcNow captured before dispatch + m.Timestamp >= dispatchTime filter applied
3 🟡 Bare _ = Task.Run(...) convention violation FIXED Now wrapped with SafeFireAndForget(Task.Run(...), "leftover-prompt-delivery")
4 🟡 Skill-validator hardcoded platform + silent errors FIXED Full uname -s/uname -m detection (Darwin-arm64, Darwin-x86_64, Linux-aarch64, Linux-x86_64) with exit 1 for unsupported. tar xzf no longer silences errors.

Design Review (unchanged from Round 2 — all clean)

  • ✅ Force delegation correctly scoped to iteration 1 only (after plan + nudge both fail)
  • ✅ Semaphore fix: leftover prompts collected synchronously in finally, delivered after release with 30s CancellationTokenSource
  • WaitForAsync now throws TimeoutException — genuine fix for misleading test failures
  • ✅ GC triple-collect with Forced, blocking: true is the canonical finalizer pipeline pattern
  • allWorkersAttempted correctly computed inline at both callsites in separate scopes

Minor Notes (not blocking)

  • No SHA256 checksum on skill-validator download — gh release download from authenticated GitHub provides reasonable integrity. Could add in a follow-up.
  • Leftover delivery theoretically races with a new reflect loop on the same orchestrator session, but the 30s timeout + SafeFireAndForget wrapper limits blast radius. Acceptable tradeoff.

Verdict: ✅ Approve — All findings addressed. Clean fixes, no new issues introduced.

Review by AI consensus (Round 3)

PureWeen and others added 8 commits March 11, 2026 15:13
Apply all 6 improvements from the Skill Validator consensus report (6.5/10 → IMPROVE):

SKILL.md:
- Add code example for INV-3 (generation guard pattern)
- Clarify INV-7: explicit guidance for new cross-thread fields vs pre-existing gap
- Add 'Diagnosing a Stuck Session' section with symptom table and diagnostic steps
- Add symptom-based trigger phrases to description (spinner stuck, hung after tool use)
- Add test file pointer: run ProcessingWatchdogTests.cs after changes

ModelCapabilities.cs:
- Add optional Claude Code live trigger testing to Anthropic evaluator prompt
- Handle auth failures gracefully (skip and note, don't retry)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 3rd worker (Eval Generator) to the Skill Validator multi-agent preset:
- Worker-3 reads SKILL.md and generates tests/eval.yaml with positive/negative scenarios
- Updated RoutingContext for 2-phase dispatch (generate evals → run evaluators)
- Updated SharedContext with eval.yaml format reference
- Updated test assertions for 3 workers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The orchestrator was using task()/bash()/view() tools itself instead of
emitting @worker:/@EnD blocks for PolyPilot to dispatch to worker sessions.

Changes to RoutingContext:
- Added explicit anti-self-work warning at the top
- Listed full worker session names (not just 'worker-1')
- Added concrete @worker block examples for each phase
- Removed 'check if eval.yaml exists' instruction that invited tool use
- Added 'NEVER use tools yourself' rule

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urns

The Copilot CLI agent has full tool access and ignores 'dispatcher only'
instructions, doing all the work itself via task()/bash()/view() before
the orchestrator response completes. By the time ParseTaskAssignments runs,
the @worker blocks are buried in a multi-minute tool session.

Fix: FlushCurrentResponse now checks accumulated FlushedResponse for
@worker:...@EnD blocks when EarlyDispatchOnWorkerBlocks is set. If found,
it resolves the ResponseCompletion TCS immediately so ParseTaskAssignments
can extract and dispatch workers while the orchestrator is still running.

The flag is set for all orchestrator planning prompts (reflect, non-reflect,
and nudge paths).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workers now use preset-defined display names (e.g., dotnet-validator,
anthropic-validator, eval-generator) instead of generic worker-1/2/3.
Updated RoutingContext examples and consensus rules to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After early dispatch resolves the TCS, the orchestrator continues doing
tool work. The reflect loop must wait for it to go idle before sending
the synthesis prompt, otherwise SendPromptAsync throws 'already processing'.
Added WaitForSessionIdleAsync helper used in both reflect and non-reflect flows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After early dispatch, the orchestrator may hit the zero-idle bug where
the SDK never emits SessionIdleEvent. Instead of waiting the full 300s
timeout, check LastEventAtTicks — if no SDK events for 60s, abort the
orchestrator and proceed to synthesis immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the reflect loop's iteration > 1 returns 0 assignments, the code
used to declare GoalMet immediately. But with 2-phase dispatch, only
the eval-generator ran in Phase 1 — the orchestrator's replan response
didn't write @worker blocks for the validators. Now we check if all
workers have been dispatched and force-dispatch remaining ones if not.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit e5d2bc0 into main Mar 12, 2026
@PureWeen PureWeen deleted the fix/orchestrator-delegation-failures branch March 12, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant