fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash by EDsCODE · Pull Request #540 · PostHog/duckgres

EDsCODE · 2026-05-07T02:03:24Z

Summary

When a worker is reaped via K8s worker unresponsive, deleting pod, OnWorkerCrash already finds affected sessions and closes the client TCP — but doesn't deliver a pgwire FATAL ErrorResponse. psql tolerates the bare close; dbt's psycopg2 adapter (with keepalives_idle: 0) parks PQconsumeInput indefinitely and the dbt process hangs silent forever.

Reproduced

dbt run --target dev_duckgres_eric --threads 8 on prod-us:

18:12:24  dbt starts CTAS for 79M-row model on worker 42109
18:13:35  last dbt log line
18:41:31  worker 42109 reaped (consecutive_failures=3)
18:41:31  Worker crashed, notifying sessions. sessions=1 pids=[1023]
          [ TCP closed, no FATAL written, dbt silent 28+ min ]

Controlled kubectl delete pod mid-query confirms: OnWorkerCrash runs in <100µs and closes the conn, but no WriteErrorResponse is on the path.

Changes

controlplane/session_mgr.go:

New SessionConn interface (io.Writer + io.Closer); *tls.Conn already satisfies it.
OnWorkerCrash calls server.WriteErrorResponse(conn, "FATAL", "08006", "worker N became unresponsive and was reaped") before Close().
SetConnCloser kept as deprecated shim so callers compile unchanged.

Concurrency note

*tls.Conn serializes Writes internally — no byte-interleaving with the message loop's bufio.Writer. A FATAL written mid-DataRow may malform an in-flight message, but that still surfaces a clean error on the client vs. the current silent hang. The wedge case (message loop blocked on Flight RPC) has nothing in flight to corrupt.

Tests

New TestOnWorkerCrash_WritesFATALBeforeClose asserts wire bytes start with E, contain FATAL, and contain the worker id.
go test ./controlplane/ passes; go vet/go build clean.

Out of scope (still open from same repro)

Credential-refresh "already activated for org" loop (PR fix(worker): rotate STS secret on a side conn, off the activation lock #516 deadlock fix didn't close this state-machine path).
Health-check 3×~3s threshold too sharp for long DDL.

🤖 Generated with Claude Code

…nn on worker crash When a worker is reaped (3 consecutive health-check failures → "K8s worker unresponsive, deleting pod"), OnWorkerCrash already finds the affected sessions, marks executors dead, and closes the client's TCP. But it didn't deliver a pgwire FATAL ErrorResponse — clients got a bare TCP close. psql tolerates this (its read loop returns, prints "connection lost", exits). But dbt's psycopg2/libpq adapter — combined with the keepalives_idle: 0 setting in dbt's profile — does not. PQconsumeInput parks waiting for a response that never comes; the dbt CLI loop never wakes; the user sees a process that's "alive but doing nothing" forever, with no error and no exit. Reproduced today on the prod-us cluster running a dbt run with --threads 8 against the dev_duckgres_eric target: 18:12:24 dbt starts CTAS for billing_finalized_line_items_delta_usage (79M-row model) on worker 42109 18:13:35 last dbt log line — DAG bottleneck, 7/8 threads idle waiting on the long model 18:35:12 worker 42109: "Failed to refresh worker S3 credentials" (separate bug, not addressed here) 18:41:31 worker 42109 reaped: consecutive_failures=3 18:41:31+ no FATAL delivered, dbt remains alive and silent for the next 28+ minutes until manually killed Then a controlled repro (kubectl delete pod on a worker mid-query) showed the same shape: psql exits cleanly with "connection lost"; CP logs show OnWorkerCrash firing and the TCP being closed; but no WriteErrorResponse was on the path. This change: - Adds a SessionConn interface (io.Writer + io.Closer) to the ManagedSession struct, replacing the bare io.Closer field. *tls.Conn from the pgwire handshake satisfies it without any caller change. - Adds a SetSessionConn entrypoint matching the new type. SetConnCloser is kept as a deprecated shim so existing callers compile unchanged (it upcasts to SessionConn when the closer is also a Writer, which *tls.Conn is, and falls back to a discarding writer otherwise). - In OnWorkerCrash, calls server.WriteErrorResponse with severity FATAL and SQLSTATE 08006 (connection_failure) before closing the TCP. The message names the worker ID that died so it can be correlated with CP logs. Concurrency note baked into the comment: the message loop also writes to this conn via its own bufio.Writer. *tls.Conn / net.Conn serialize Write calls internally so we don't byte-interleave, but a FATAL packet written mid-DataRow will malform the in-flight message. That's still strictly better than the bare-close behavior — the client surfaces an error cleanly instead of hanging — and the typical case (CTAS in flight, message loop blocked on Flight RPC) has nothing in flight on the wire to corrupt. Tests: - New TestOnWorkerCrash_WritesFATALBeforeClose asserts the wire bytes contain 'E' (ErrorResponse opcode), 'FATAL', and the worker id. - mockCloser now also implements io.Writer with byte capture so other existing tests are unaffected; field references updated from connCloser to conn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

EDsCODE and others added 2 commits May 6, 2026 22:02

fix(controlplane): avoid session lock during crash notification

660b78b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540

fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540
EDsCODE wants to merge 2 commits intomainfrom
fix/notify-sessions-on-worker-reap

EDsCODE commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EDsCODE commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduced

Changes

Concurrency note

Tests

Out of scope (still open from same repro)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EDsCODE commented May 7, 2026 •

edited

Loading