Skip to content

fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540

Open
EDsCODE wants to merge 2 commits intomainfrom
fix/notify-sessions-on-worker-reap
Open

fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540
EDsCODE wants to merge 2 commits intomainfrom
fix/notify-sessions-on-worker-reap

Conversation

@EDsCODE
Copy link
Copy Markdown
Contributor

@EDsCODE EDsCODE commented May 7, 2026

Summary

When a worker is reaped via K8s worker unresponsive, deleting pod, OnWorkerCrash already finds affected sessions and closes the client TCP — but doesn't deliver a pgwire FATAL ErrorResponse. psql tolerates the bare close; dbt's psycopg2 adapter (with keepalives_idle: 0) parks PQconsumeInput indefinitely and the dbt process hangs silent forever.

Reproduced

dbt run --target dev_duckgres_eric --threads 8 on prod-us:

18:12:24  dbt starts CTAS for 79M-row model on worker 42109
18:13:35  last dbt log line
18:41:31  worker 42109 reaped (consecutive_failures=3)
18:41:31  Worker crashed, notifying sessions. sessions=1 pids=[1023]
          [ TCP closed, no FATAL written, dbt silent 28+ min ]

Controlled kubectl delete pod mid-query confirms: OnWorkerCrash runs in <100µs and closes the conn, but no WriteErrorResponse is on the path.

Changes

controlplane/session_mgr.go:

  • New SessionConn interface (io.Writer + io.Closer); *tls.Conn already satisfies it.
  • OnWorkerCrash calls server.WriteErrorResponse(conn, "FATAL", "08006", "worker N became unresponsive and was reaped") before Close().
  • SetConnCloser kept as deprecated shim so callers compile unchanged.

Concurrency note

*tls.Conn serializes Writes internally — no byte-interleaving with the message loop's bufio.Writer. A FATAL written mid-DataRow may malform an in-flight message, but that still surfaces a clean error on the client vs. the current silent hang. The wedge case (message loop blocked on Flight RPC) has nothing in flight to corrupt.

Tests

  • New TestOnWorkerCrash_WritesFATALBeforeClose asserts wire bytes start with E, contain FATAL, and contain the worker id.
  • go test ./controlplane/ passes; go vet/go build clean.

Out of scope (still open from same repro)

🤖 Generated with Claude Code

EDsCODE and others added 2 commits May 6, 2026 22:02
…nn on worker crash

When a worker is reaped (3 consecutive health-check failures →
"K8s worker unresponsive, deleting pod"), OnWorkerCrash already finds
the affected sessions, marks executors dead, and closes the client's
TCP. But it didn't deliver a pgwire FATAL ErrorResponse — clients got
a bare TCP close.

psql tolerates this (its read loop returns, prints "connection lost",
exits). But dbt's psycopg2/libpq adapter — combined with the
keepalives_idle: 0 setting in dbt's profile — does not. PQconsumeInput
parks waiting for a response that never comes; the dbt CLI loop never
wakes; the user sees a process that's "alive but doing nothing"
forever, with no error and no exit.

Reproduced today on the prod-us cluster running a dbt run with
--threads 8 against the dev_duckgres_eric target:
  18:12:24  dbt starts CTAS for billing_finalized_line_items_delta_usage
            (79M-row model) on worker 42109
  18:13:35  last dbt log line — DAG bottleneck, 7/8 threads idle
            waiting on the long model
  18:35:12  worker 42109: "Failed to refresh worker S3 credentials"
            (separate bug, not addressed here)
  18:41:31  worker 42109 reaped: consecutive_failures=3
  18:41:31+ no FATAL delivered, dbt remains alive and silent
            for the next 28+ minutes until manually killed

Then a controlled repro (kubectl delete pod on a worker mid-query)
showed the same shape: psql exits cleanly with "connection lost"; CP
logs show OnWorkerCrash firing and the TCP being closed; but no
WriteErrorResponse was on the path.

This change:
- Adds a SessionConn interface (io.Writer + io.Closer) to the
  ManagedSession struct, replacing the bare io.Closer field. *tls.Conn
  from the pgwire handshake satisfies it without any caller change.
- Adds a SetSessionConn entrypoint matching the new type. SetConnCloser
  is kept as a deprecated shim so existing callers compile unchanged
  (it upcasts to SessionConn when the closer is also a Writer, which
  *tls.Conn is, and falls back to a discarding writer otherwise).
- In OnWorkerCrash, calls server.WriteErrorResponse with severity FATAL
  and SQLSTATE 08006 (connection_failure) before closing the TCP. The
  message names the worker ID that died so it can be correlated with
  CP logs.

Concurrency note baked into the comment: the message loop also writes
to this conn via its own bufio.Writer. *tls.Conn / net.Conn serialize
Write calls internally so we don't byte-interleave, but a FATAL packet
written mid-DataRow will malform the in-flight message. That's still
strictly better than the bare-close behavior — the client surfaces an
error cleanly instead of hanging — and the typical case (CTAS in
flight, message loop blocked on Flight RPC) has nothing in flight on
the wire to corrupt.

Tests:
- New TestOnWorkerCrash_WritesFATALBeforeClose asserts the wire bytes
  contain 'E' (ErrorResponse opcode), 'FATAL', and the worker id.
- mockCloser now also implements io.Writer with byte capture so other
  existing tests are unaffected; field references updated from
  connCloser to conn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant