fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540
Open
fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash#540
Conversation
…nn on worker crash
When a worker is reaped (3 consecutive health-check failures →
"K8s worker unresponsive, deleting pod"), OnWorkerCrash already finds
the affected sessions, marks executors dead, and closes the client's
TCP. But it didn't deliver a pgwire FATAL ErrorResponse — clients got
a bare TCP close.
psql tolerates this (its read loop returns, prints "connection lost",
exits). But dbt's psycopg2/libpq adapter — combined with the
keepalives_idle: 0 setting in dbt's profile — does not. PQconsumeInput
parks waiting for a response that never comes; the dbt CLI loop never
wakes; the user sees a process that's "alive but doing nothing"
forever, with no error and no exit.
Reproduced today on the prod-us cluster running a dbt run with
--threads 8 against the dev_duckgres_eric target:
18:12:24 dbt starts CTAS for billing_finalized_line_items_delta_usage
(79M-row model) on worker 42109
18:13:35 last dbt log line — DAG bottleneck, 7/8 threads idle
waiting on the long model
18:35:12 worker 42109: "Failed to refresh worker S3 credentials"
(separate bug, not addressed here)
18:41:31 worker 42109 reaped: consecutive_failures=3
18:41:31+ no FATAL delivered, dbt remains alive and silent
for the next 28+ minutes until manually killed
Then a controlled repro (kubectl delete pod on a worker mid-query)
showed the same shape: psql exits cleanly with "connection lost"; CP
logs show OnWorkerCrash firing and the TCP being closed; but no
WriteErrorResponse was on the path.
This change:
- Adds a SessionConn interface (io.Writer + io.Closer) to the
ManagedSession struct, replacing the bare io.Closer field. *tls.Conn
from the pgwire handshake satisfies it without any caller change.
- Adds a SetSessionConn entrypoint matching the new type. SetConnCloser
is kept as a deprecated shim so existing callers compile unchanged
(it upcasts to SessionConn when the closer is also a Writer, which
*tls.Conn is, and falls back to a discarding writer otherwise).
- In OnWorkerCrash, calls server.WriteErrorResponse with severity FATAL
and SQLSTATE 08006 (connection_failure) before closing the TCP. The
message names the worker ID that died so it can be correlated with
CP logs.
Concurrency note baked into the comment: the message loop also writes
to this conn via its own bufio.Writer. *tls.Conn / net.Conn serialize
Write calls internally so we don't byte-interleave, but a FATAL packet
written mid-DataRow will malform the in-flight message. That's still
strictly better than the bare-close behavior — the client surfaces an
error cleanly instead of hanging — and the typical case (CTAS in
flight, message loop blocked on Flight RPC) has nothing in flight on
the wire to corrupt.
Tests:
- New TestOnWorkerCrash_WritesFATALBeforeClose asserts the wire bytes
contain 'E' (ErrorResponse opcode), 'FATAL', and the worker id.
- mockCloser now also implements io.Writer with byte capture so other
existing tests are unaffected; field references updated from
connCloser to conn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a worker is reaped via
K8s worker unresponsive, deleting pod,OnWorkerCrashalready finds affected sessions and closes the client TCP — but doesn't deliver a pgwire FATAL ErrorResponse. psql tolerates the bare close; dbt's psycopg2 adapter (withkeepalives_idle: 0) parksPQconsumeInputindefinitely and the dbt process hangs silent forever.Reproduced
dbt run --target dev_duckgres_eric --threads 8on prod-us:Controlled
kubectl delete podmid-query confirms:OnWorkerCrashruns in <100µs and closes the conn, but noWriteErrorResponseis on the path.Changes
controlplane/session_mgr.go:SessionConninterface (io.Writer + io.Closer);*tls.Connalready satisfies it.OnWorkerCrashcallsserver.WriteErrorResponse(conn, "FATAL", "08006", "worker N became unresponsive and was reaped")beforeClose().SetConnCloserkept as deprecated shim so callers compile unchanged.Concurrency note
*tls.Connserializes Writes internally — no byte-interleaving with the message loop'sbufio.Writer. A FATAL written mid-DataRow may malform an in-flight message, but that still surfaces a clean error on the client vs. the current silent hang. The wedge case (message loop blocked on Flight RPC) has nothing in flight to corrupt.Tests
TestOnWorkerCrash_WritesFATALBeforeCloseasserts wire bytes start withE, containFATAL, and contain the worker id.go test ./controlplane/passes;go vet/go buildclean.Out of scope (still open from same repro)
🤖 Generated with Claude Code