Skip to content

CLI crashes during parallel background agent execution (OOM/TypeScript error in long sessions) #2132

@harsha549

Description

@harsha549

Describe the bug

During extended sessions with parallel background agents (launched via the task tool), the CLI crashes abruptly mid-execution with what appears to be a TypeScript/JavaScript runtime error. The session terminates without graceful shutdown, losing any in-flight agent work.

Pattern observed across 3 separate sessions:

  • Session runs for 30-60 minutes with complex multi-step builds
  • 4-7 background agents launched in parallel waves (via task tool with mode: "background")
  • Context window accumulates significantly (large plan + agent results + file verification)
  • CLI suddenly exits with a script/TypeScript error — no clean shutdown, no checkpoint

The crash appears correlated with:

  1. High parallel agent count (4+ simultaneous background agents)
  2. Large context window accumulation (no automatic compaction triggered before crash)
  3. Heavy events.jsonl write activity from multiple concurrent agent completions

Environment

  • OS: Windows 11 (PowerShell 7)
  • CLI Version: GitHub Copilot CLI 1.0.x (latest via winget)
  • Models used: Claude Opus 4.6, Claude Sonnet 4.6 (switched mid-session)
  • Session duration at crash: ~45-60 minutes
  • Parallel agents at time of crash: 4-7 background agents

Steps to reproduce

  1. Start copilot in a large workspace (30+ repos)
  2. Create a complex plan with 30+ todos tracked in SQLite
  3. Launch 4+ background agents simultaneously via task tool (mode: "background")
  4. Allow agents to complete and accumulate results in context
  5. Launch another wave of 3-4 agents without running /compact
  6. CLI crashes within 10-20 minutes of sustained multi-agent activity

Expected behavior

  • CLI should gracefully handle parallel agent load or surface a warning when approaching memory limits
  • If crash is unavoidable, CLI should checkpoint session state before exiting
  • events.jsonl should be flushed/synced before crash to enable clean session resume

Workarounds implemented

We've built a resilience layer to mitigate this:

  • SQLite as state store (not events.jsonl) — todos, agent_log, wave_checkpoints tables survive crash
  • Capped parallel agents at 4 (down from 7) to reduce memory pressure
  • Mandatory /compact after each wave to keep context window manageable
  • JSON checkpoint files written to files/ after each wave for crash recovery
  • Session documentation enabling full state restoration in a new session

Relationship to existing issues

Suggestion

Consider exposing:

  1. A context window usage threshold that triggers automatic /compact before OOM
  2. A max-parallel-agents limit (configurable via config.json)
  3. A pre-crash checkpoint hook that saves session state atomically
  4. WAL mode for events.jsonl writes (or switch to SQLite internally for event storage)

Additional context

Detailed workaround architecture and crash recovery playbook available if helpful to the team. Happy to provide events.jsonl samples or session state artifacts from crashed sessions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions