Skip to content

Conversation

@ThomasK33
Copy link
Member

Summary

Fixes a bug where the TCP transport removed clients without notifying the higher-level server module, causing claudecode.server.init's state.clients mirror to accumulate stale entries over time.

Key changes:

  • Centralize disconnect behavior via tcp._disconnect_client so on_disconnect fires for EOF, read errors, protocol errors, WebSocket CLOSE frames, and keepalive timeouts.
  • Make disconnect handling idempotent to avoid double-callbacks on multi-path teardown.
  • Add regression tests and enhance the TCP mock to simulate read_start callbacks.

Fixes #170.

Testing

  • nix develop -c make check
  • nix develop -c make test

📋 Implementation Plan

Plan: Investigate + fix Issue #170 (client tables out of sync)

Context / Why

GitHub Issue #170 reports that the WebSocket server keeps two Lua tables of connected clients and they drift out of sync, leaving “phantom” clients in server/init.lua and causing unbounded growth over time (memory leak / misleading debug state).

This should be treated as a bug: the two tables are intended to represent the same set of live connections, but the TCP layer doesn’t reliably notify the higher-level server module on abrupt disconnect paths (EOF/errors/timeouts).

Evidence (what I checked)

  • Issue report: [BUG] Two tables with connected clients #170 (two client tables; out-of-sync after Ctrl-C termination).
  • lua/claudecode/server/init.lua
    • Maintains M.state.clients.
    • Populates it only via callbacks:
      • on_connect: M.state.clients[client.id] = client (line ~56)
      • on_disconnect: M.state.clients[client.id] = nil (line ~74)
  • lua/claudecode/server/tcp.lua
    • Owns the transport-level server.clients map.
    • In client_tcp:read_start(...), EOF (not data) and read errors call only M._remove_client(server, client) and do not call server.on_disconnect(...) (lines ~125–136).
    • In the WebSocket close-frame handler it does call server.on_disconnect(...) and then _remove_client(...) (lines ~139–147).
    • Ping timeout path also calls _remove_client(...) without on_disconnect (lines ~283–297).
Why this causes “phantom” clients

server/init.lua is only updated through on_connect/on_disconnect. When the CLI is killed (Ctrl-C), the server usually sees a TCP EOF, not a WebSocket CLOSE frame; the TCP module removes the client from server.clients but never fires on_disconnect, so M.state.clients retains a stale entry forever.

Plan (fix)

1) Make disconnect notification consistent (core fix)

Goal: whenever a client is removed from server.clients, server.on_disconnect(client, code, reason) should be invoked exactly once.

  • In lua/claudecode/server/tcp.lua, introduce a single helper responsible for cleanup + notification, e.g.:
    • M._disconnect_client(server, client, code, reason)
    • or extend M._remove_client(server, client, code, reason) to optionally notify.
  • Ensure the helper is idempotent by guarding on server.clients[client.id] (only notify/remove if still present). This prevents double-calling if a CLOSE frame is processed and later a TCP EOF arrives.
  • Route all removal paths through this helper:
    • TCP read_start error branch
    • TCP EOF branch
    • WebSocket CLOSE frame callback (replace the current “call on_disconnect then _remove_client” with the helper)
    • WebSocket protocol/error callback (on_error branch) since it currently removes without disconnect notification
    • Keepalive timeout branch in start_ping_timer.

Close code/reason conventions (keep simple):

  • EOF / read error / protocol error / keepalive timeout: use WebSocket “abnormal closure” 1006 with a short reason string (e.g., "EOF", "Client read error: ...", "Connection timeout").
  • WebSocket CLOSE frame: preserve provided code and reason.

Defensive checks (per repo preference):

  • Add assert(type(server) == "table"), assert(type(server.on_disconnect) == "function"), assert(type(client) == "table" and type(client.id) == "string") in the helper.

2) Add regression tests (so it can’t come back)

Because the unit test suite runs with a mocked vim.loop, we need to simulate the EOF/error callback.

  • Update tests/mocks/vim.lua TCP stub so read_start(cb) stores cb on the handle (e.g., self._read_cb = cb).
  • Add a new unit test file (e.g., tests/unit/tcp_server_spec.lua) that:
    1. Creates a TCP server via require("claudecode.server.tcp").create_server(...) with spy callbacks.
    2. Manually calls tcp._handle_new_connection(server).
    3. Captures the connected client via on_connect.
    4. Simulates abrupt termination by invoking client.tcp_handle._read_cb(nil, nil).
    5. Asserts on_disconnect was called once and server.clients[client.id] is nil.

Optional (nice-to-have) extra tests:

  • Simulate read_start("some error", nil) to ensure on_disconnect fires on read errors.
  • Stub claudecode.server.client.process_data to force the tcp.lua “protocol error” callback and ensure disconnect notification occurs.

3) Clarify design intent (small docs/comments)

The existence of two tables isn’t inherently wrong: tcp.lua needs an internal map for transport operations, while server/init.lua keeps “server module state”. The bug is that they weren’t kept in sync.

  • Add a short comment near the type annotations in both files clarifying:
    • tcp.lua: server.clients is the canonical transport client registry.
    • server/init.lua: M.state.clients is a mirrored view updated via connect/disconnect callbacks (kept for API/debugging/back-compat).

4) Verify

  • Run make test and make check.
  • Manual sanity check in real Neovim:
    • Start a Claude Code session, then terminate it via Ctrl-C.
    • Confirm both registries report no remaining clients (and no growth over repeated connect/kill cycles).

Generated with mux • Model: openai:gpt-5.2 • Thinking: xhigh

Change-Id: I53259ea2f51407e4bde33258ba0639cf00b2c29a
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Two tables with connected clients

1 participant