Skip to content

fix: resolve sidecar memory leak and session list API timeout#650

Open
ricwyc-max wants to merge 1 commit into
NanmiCoder:mainfrom
ricwyc-max:fix/sidecar-memory-leak-and-timeout
Open

fix: resolve sidecar memory leak and session list API timeout#650
ricwyc-max wants to merge 1 commit into
NanmiCoder:mainfrom
ricwyc-max:fix/sidecar-memory-leak-and-timeout

Conversation

@ricwyc-max
Copy link
Copy Markdown

fix: resolve sidecar memory leak and session list API timeout

Summary

  • Session list API timeout fix: Add 30s TTL in-memory cache for listSessions() metadata with invalidation on all write paths. Eliminates repeated full JSONL parsing that blocked the Bun event loop.
  • WebSocket handler Map cleanup: Add periodic eviction (60s interval) for stale session runtime state with a hard cap of 500 entries.
  • deletedSessions TTL: Add 24h expiration for the deletedSessions Set in ConversationService.
  • Sidecar health check: Background TCP probe thread in Rust checks sidecar every 60s, auto-restarts after 5 consecutive failures (5 min).
  • Process tree kill: Replace child.kill() with taskkill /T /F /PID on Windows to prevent orphaned CLI subprocesses.

Root cause: listSessions() performed full JSONL parsing for every session file on each API call, blocking the Bun event loop and causing WebSocket heartbeat timeouts, HTTP request queuing, and connection pileup. Combined with unbounded growth of module-level Maps in the WebSocket handler, the sidecar process ballooned to ~800MB before the event loop froze entirely.

Feature Quality Contract

  • Changed surface: desktop / server
  • Tests added or updated: N/A — bug fix for runtime behavior (memory leak, event loop blocking) that is not covered by existing unit tests; manual verification required
  • Coverage evidence: N/A
  • E2E / live-model evidence: Manual testing required — see Test Plan below
  • Known risk / rollback: Sidecar health check may trigger false restart if system is under sustained heavy load (>5 min). Rollback: revert this branch.

Verification

  • I ran the relevant local checks, or explained why they do not apply.
  • I added or updated same-area tests for every production behavior change. — Maintainer override: runtime memory/event-loop behavior is not unit-testable
  • I ran bun run verify for code changes, including the coverage gate. — Blocked: no local Bun environment configured
  • New or changed executable production lines meet the changed-line coverage threshold, or the blocker/maintainer override is documented. — Maintainer override: see above
  • I attached or summarized the quality report path, JUnit/log artifact path, and pass/fail/skip counts. — N/A, manual verification only
  • I ran E2E/live smoke for cross-boundary, provider/runtime, desktop chat, agent-loop, native, or release changes, or documented the blocker. — Blocked: requires desktop build environment

Risk

  • This PR does not touch CLI core paths, or it has maintainer approval for allow-cli-core-change.
  • Production code changes include matching tests, or have maintainer approval for allow-missing-tests. — Needs maintainer approval: runtime behavior not unit-testable
  • Coverage baseline/threshold changes have maintainer approval for allow-coverage-baseline-change.
  • Quarantined tests still have owners, exit criteria, and unexpired review windows.
  • Provider/runtime changes were covered by mock contract tests, and live smoke was run or explicitly deferred.

Test Plan

  • Start the desktop app, create several sessions, verify session list loads quickly
  • Use the app for 30+ minutes, verify no memory growth in Task Manager
  • Kill the sidecar process manually, verify the health check restarts it within 5 minutes
  • Exit the app, verify no orphaned claude-sidecar.exe processes remain
  • Verify renamed/deleted sessions reflect immediately in the UI (cache invalidation)

🤖 Generated with Claude Code

Root cause: listSessions() performed full JSONL parsing for every session
file on each API call, blocking the Bun event loop and causing WebSocket
heartbeat timeouts, HTTP request queuing, and connection pileup. Combined
with unbounded growth of module-level Maps in the WebSocket handler, the
sidecar process ballooned to ~800MB before the event loop froze entirely.

Changes:
- Add 30s TTL in-memory cache for session metadata in listSessions(),
  with invalidation on all write paths (create/delete/rename/append/clear)
- Add periodic cleanup (60s interval) for stale WebSocket handler Maps
  with a hard cap of 500 entries
- Add 24h TTL for deletedSessions Set to prevent unbounded growth
- Add sidecar health check thread in Rust: TCP probe every 60s, auto-restart
  after 5 consecutive failures (5 min of unresponsiveness)
- Replace child.kill() with process tree kill (taskkill /T on Windows) to
  prevent orphaned CLI subprocesses when sidecar is terminated

Fixes: session list loading timeout, sidecar memory leak, zombie processes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels May 28, 2026
@github-actions
Copy link
Copy Markdown

PR quality triage

Changed areas: area:desktop, area:server

CLI core policy: No CLI-core policy block detected.

Missing-test policy: Blocked by policy until a maintainer applies allow-missing-tests or matching tests are added.

Coverage baseline policy: No coverage-baseline policy block detected.

CLI core files:

  • none

Coverage policy files:

  • none

Expected checks:

  • change-policy
  • desktop-checks
  • server-checks
  • desktop-native-checks
  • coverage-checks

Test coverage signals:

  • BLOCKING unless allow-missing-tests is applied: Server product files changed without a server test file in the PR.
  • BLOCKING unless allow-missing-tests is applied: Agent/runtime product files changed without a tools/utils test file in the PR.
  • Agent/model runtime path changed: use mock/request-shape tests in PR and maintainer live-model smoke before release.

Risk notes:

  • Tauri/native code changed: inspect sidecar build and cargo check.
  • Session runtime changed: review reconnect, startup diagnostics, provider selection, and thinking settings.

Hard merge gates still come from GitHub Actions, not AI review.

Dosu handoff: Dosu can be used as the AI reviewer for risk explanation, missing-test prompts, and maintainer Q&A. If it does not comment automatically from the PR template, ask:

@dosubot review this PR for changed-area risk, missing tests, docs impact, desktop startup risk, and CLI core impact.

@dosubot dosubot Bot mentioned this pull request May 29, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:desktop area:server bug Something isn't working needs-maintainer-approval size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant