Add disconnect_and_retry #753

greg-rychlewski · 2025-11-12T04:07:13Z

Some disclaimers and explanations:

I opted to do the :disconnect -> :disconnect_and_retry transformation closer to the top level than deeper in the protocol messages because I think it's easier to control things and put them in the right place. There's a lot of protcol handling functions that are shared where I'm not sure it's appropriate to retry. For example on the various reload functions.
I opted to add this only to handle_prepare and handle_begin. For the other ones I had these reasons for leaving them out:
- handle_commit: no transaction to commit after connection drops
- handle_close: no prepared statement to close after connection drops
- handle_deallocate: no cursor to clean up after connection drops
- handle_declare: no prepared statement to use for cursor after connection drops
- handle_execute: no prepared statement to use for execution after connection drops
- handle_fetch: no cursor to fetch from after connection drops
- handle_rollback: no transaction to rollback after connection drops
- handle_status: no transaction to get status of after connection drops
- ping: ping doesn't have command to run it will just reconnect the connection by virtue of pinging it so don't have to retry

greg-rychlewski · 2025-11-12T11:00:18Z

ok got the test working :)

josevalim · 2025-11-12T11:16:25Z

Beautiful!

handle_close: no prepared statement to close after connection drops

it has been a while but my understanding is that we synchronize prepared queries across connections by storing queries in the ETS table and then deleting them on close. So if connection A cannot close, we still want to try to close it on B, so we delete it from the table. You could say that our close is then pointless, as we don't close across all and we should only remove from the table, but we need to keep it working for transactions and other cases. In other words, I think we should retry. Alternatively, we don't retry, but we delete from the table sooner, but that may have other caveats.

handle_declare: no prepared statement to use for cursor after connection drops

I know we need a transaction for declare to work. But is that Postgrex requirements or a DBConnection one? 🤔 If DBConnection, you are right, there is no need to support retries on declare.

josevalim · 2025-11-12T11:18:48Z

I know we need a transaction for declare to work. But is that Postgrex requirements or a DBConnection one? 🤔 If DBConnection, you are right, there is no need to support retries on declare.

It is a DBConnection requirement, so I changed it so we don't support retry on handle_declare.

test/query_test.exs

greg-rychlewski · 2025-11-12T11:33:10Z

Beautiful!

handle_close: no prepared statement to close after connection drops

it has been a while but my understanding is that we synchronize prepared queries across connections by storing queries in the ETS table and then deleting them on close. So if connection A cannot close, we still want to try to close it on B, so we delete it from the table. You could say that our close is then pointless, as we don't close across all and we should only remove from the table, but we need to keep it working for transactions and other cases. In other words, I think we should retry. Alternatively, we don't retry, but we delete from the table sooner, but that may have other caveats.

You’re absolutely right! I’ll add it

greg-rychlewski · 2025-11-12T12:06:25Z

@josevalim after this pr is merged would you be alright if I try to submit another one handling bind messages? I was thinking about it more and probably a large portion of failures will happen from people trying to execute already cached queries. So trying to catch the closure at the bind phase is probably worth it?

josevalim · 2025-11-12T13:11:04Z

@greg-rychlewski yes, we 100% should. Feel free to merge this in or push it as part of this PR. I will make sure DBConnection allows retry on execute.

josevalim · 2025-11-12T13:12:00Z

will make sure DBConnection allows retry on execute.

We do! 🥳

lib/postgrex/protocol.ex

greg-rychlewski · 2025-11-12T19:24:21Z

ok it looks like it's all working now :). i left a couple comments where to explain a couple choices

lib/postgrex/protocol.ex

josevalim · 2025-11-13T07:01:20Z

Hi @greg-rychlewski! I have dropped some comments. I am wondering if we should move the handle_disconnect_retry inside on all operations? For example, in handle_prepare, we could change parse_describe_flush and do recv_parse_describe(s, status, query, buffer) |> retryable()? Note though we do not need to do it on savepoint mode branches. I think this will at least be safer when it comes to execute, as there is no change we accidentally rerun the whole query?

josevalim · 2025-11-13T07:06:32Z

@greg-rychlewski gah, nevermind. I played with the idea above a bit and there are issues for things like reload_prepare, which is used for fetch too, which cannot be tried.

So I think we need to mix and match approaches after all.

josevalim · 2025-11-13T07:13:09Z

Ok, given everything above, here are the final changes I propose:

diff --git a/lib/postgrex/protocol.ex b/lib/postgrex/protocol.ex
index 883de6f..a71685e 100644
--- a/lib/postgrex/protocol.ex
+++ b/lib/postgrex/protocol.ex
@@ -433,8 +433,7 @@ defmodule Postgrex.Protocol do
         handle_execute_copy(query, params, opts, s)

       false ->
-        result = handle_execute_result(query, params, opts, s)
-        handle_disconnect_retry(result)
+        handle_execute_result(query, params, opts, s)
     end
   end

@@ -2127,7 +2126,7 @@ defmodule Postgrex.Protocol do
     ]

     with :ok <- msg_send(%{s | buffer: nil}, msgs, buffer),
-         {:ok, s, buffer} <- recv_bind(s, status, buffer),
+         {:ok, s, buffer} <- recv_bind(s, status, buffer) |> handle_disconnect_retry(),
          {:ok, result, s, buffer} <- recv_execute(s, status, query, buffer),
          {:ok, s, buffer} <- recv_close(s, status, buffer),
          {:ok, s} <- recv_ready(s, status, buffer) do
@@ -2137,7 +2136,7 @@ defmodule Postgrex.Protocol do
         error_ready(s, status, err, buffer)
         |> maybe_disconnect()

-      {:disconnect, _err, _s} = disconnect ->
+      {_disconnect_or_retry, _err, _s} = disconnect ->
         disconnect
     end
   end
@@ -2164,7 +2163,7 @@ defmodule Postgrex.Protocol do
     ]

     with :ok <- msg_send(%{s | buffer: nil}, msgs, buffer),
-         {:ok, s, buffer} <- recv_bind(s, status, buffer),
+         {:ok, s, buffer} <- recv_bind(s, status, buffer) |> handle_disconnect_retry(),
          {:ok, result, s, buffer} <- recv_execute(s, status, query, buffer),
          {:ok, s} <- recv_ready(s, status, buffer) do
       {:ok, query, result, s}
@@ -2175,7 +2174,7 @@ defmodule Postgrex.Protocol do
         error_ready(s, status, err, buffer)
         |> maybe_disconnect()

-      {:disconnect, _err, _s} = disconnect ->
+      {_disconnect_or_retry, _err, _s} = disconnect ->
         disconnect
     end
   end
@@ -2317,9 +2316,7 @@ defmodule Postgrex.Protocol do
         recv_execute(s, status, query, rows, buffer)

       {:disconnect, _, _} = dis ->
-        with {_, %{reason: :closed} = err, s} <- dis do
-          {:disconnect, %{err | reason: :execute_closed}, s}
-        end
+        dis
     end
   end

greg-rychlewski · 2025-11-13T08:53:52Z

@josevalim Thanks for the feedback. Going to apply your suggestions but something crossed my mind after reading what you said. I think any of these things can be part of a transaction (even if :mode != :savepoint), in which case if I'm remembering the DBConnection changes correctly we would retry the entire transaction if a connection drops in the middle.

Am I reading too much into it or do you think we would potentially cause problems if people are putting side effects in their transactions. They can always change the retry option to 0 but I was wondering if it changes silently on them all of a sudden should we consider making it opt-in.

greg-rychlewski · 2025-11-13T08:57:32Z

@josevalim Btw I also think we might be able to move everything lower. I had the same reservation about the reload stuff when I first made the PR but I just looked at it again and I think reload_fetch here actually means "fetch the types again when I try to prepare and don't recognize an oid" rather than anything about cursors.

greg-rychlewski · 2025-11-13T09:19:19Z

I guess we have the :postgres value in the state. So if we move everything lower level we can not retry if its value is :transaction. Just have to be a bit careful I think because sometimes its a tuple with the last prepared query reference, I believe.

greg-rychlewski · 2025-11-13T10:19:16Z

Ok I made some changes that I believe are improvements based on all the things you said:

Moved the retry handling lower. I believe it's fine because the fetch in these situations was referring to fetching type info and not cursors
Check if the status is :transaction and don't retry if that's the case. So basically will not retry if something fails mid transaction because we don't know what all else will be retried
For handle_execute retry on both receiving bind and sending the execution message. Because most of the time it will fail on sending the message before it receives any data back I believe we need that too

lib/postgrex/protocol.ex

josevalim · 2025-11-13T11:14:11Z

@greg-rychlewski if the connection is checked out or in a transaction, DBConnection converts disconnect_and_retry back to disconnect, so that's not a concern.

In any case, I posted my review. I think we can either:

Ship this version once my comments are addressed
Push the previous version with my patch applied

I am fine either way, your call!

greg-rychlewski · 2025-11-13T13:11:52Z

@josevalim Thanks! I reverted back and applied your changes. The only thing I added was the retry handling to msg_send in addition to recv_bind in the execute handling. The reason is because if the connection has been dropped for a while it will fail there.

lib/postgrex/protocol.ex

Co-authored-by: José Valim <jose.valim@gmail.com>

greg-rychlewski added 5 commits November 11, 2025 22:59

disonnect_and_retry

dc851b8

update typespecs

88783cc

add test

9112d61

fix test

7f87735

fix typespec and format

4b0c15f

greg-rychlewski added 3 commits November 12, 2025 06:02

format

a0d38cc

format

dc48b50

format

30f019a

josevalim reviewed Nov 12, 2025

View reviewed changes

test/query_test.exs Outdated Show resolved Hide resolved

josevalim approved these changes Nov 12, 2025

View reviewed changes

add more cases

1217862

greg-rychlewski commented Nov 12, 2025

View reviewed changes