Skip to content

Orchestrator session fails to restore when session is locked by Copilot CLI #334

@PureWeen

Description

@PureWeen

Problem

When a multi-agent orchestrator session's ID is shared with a running Copilot CLI terminal session, PolyPilot cannot restore the orchestrator on restart. The SDK's ResumeSessionAsync reports the session as "corrupted" because the events.jsonl file is locked by the CLI process.

Root Cause

  1. PolyPilot creates an orchestrator session (e.g., Implement & Challenge-orchestrator) via the Copilot SDK
  2. A Copilot CLI terminal session takes over that same session ID (writes to the same events.jsonl)
  3. The CLI process holds a file lock (inuse.<pid>.lock) on the session directory
  4. On PolyPilot restart, RestorePreviousSessionsAsync calls ResumeSessionAsync which fails because the events file is locked by the CLI process
  5. The SDK error message says "session file is corrupted" — misleading, since the data isn't actually corrupt

Current Behavior

  • The orchestrator session silently disappears from the multi-agent group after restart
  • UI shows "Session data appears corrupted" if user manually tries to resume
  • The multi-agent group is left without an orchestrator, making it non-functional

Expected Behavior

  • PolyPilot should detect the lock conflict and either:
    • a) Show a clear message: "Session is locked by Copilot CLI (PID XXXX). Close the CLI session to restore." with an option to force-create a new session
    • b) Automatically detect the lock file, check if the owning PID is still alive, and create a fresh orchestrator session if the lock is stale
    • c) Avoid sharing session IDs between PolyPilot-managed sessions and external Copilot CLI sessions entirely

Steps to Reproduce

  1. Create a multi-agent team (e.g., "MultiAgentPRetty") with OrchestratorReflect mode
  2. Open a Copilot CLI terminal that uses the orchestrator's session ID
  3. Restart PolyPilot (e.g., via relaunch.ps1)
  4. Observe the orchestrator session is missing from the group

Technical Details

  • Session ID: 74895dba-fd61-4a73-9a5d-4576a146aa0b
  • Lock file: ~/.copilot/session-state/<id>/inuse.<pid>.lock
  • Events file: 15MB events.jsonl (actively written by CLI)
  • The IsCorruptSessionError check in SessionSidebar.razor catches this case
  • Current fallback in RestorePreviousSessionsAsync only handles "Session not found", not corruption/lock errors

Suggested Approach

  1. In RestorePreviousSessionsAsync, before calling ResumeSessionAsync, check for inuse.*.lock files in the session directory. If a lock exists and the PID is alive, skip resume and log a warning.
  2. For multi-agent orchestrator sessions specifically, consider always creating a fresh session on restore (orchestrators are stateless planners — conversation history isn't critical).
  3. Improve the SDK error message or add PolyPilot-side lock detection to give users actionable information.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalUpstream bug or dependency issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions