Skip to content

Conversation

@greg-rychlewski
Copy link
Member

@greg-rychlewski greg-rychlewski commented Nov 12, 2025

Some disclaimers and explanations:

  • I opted to do the :disconnect -> :disconnect_and_retry transformation closer to the top level than deeper in the protocol messages because I think it's easier to control things and put them in the right place. There's a lot of protcol handling functions that are shared where I'm not sure it's appropriate to retry. For example on the various reload functions.

  • I opted to add this only to handle_prepare and handle_begin. For the other ones I had these reasons for leaving them out:

    • handle_commit: no transaction to commit after connection drops
    • handle_close: no prepared statement to close after connection drops
    • handle_deallocate: no cursor to clean up after connection drops
    • handle_declare: no prepared statement to use for cursor after connection drops
    • handle_execute: no prepared statement to use for execution after connection drops
    • handle_fetch: no cursor to fetch from after connection drops
    • handle_rollback: no transaction to rollback after connection drops
    • handle_status: no transaction to get status of after connection drops
    • ping: ping doesn't have command to run it will just reconnect the connection by virtue of pinging it so don't have to retry

@greg-rychlewski
Copy link
Member Author

ok got the test working :)

@josevalim
Copy link
Member

Beautiful!

handle_close: no prepared statement to close after connection drops

it has been a while but my understanding is that we synchronize prepared queries across connections by storing queries in the ETS table and then deleting them on close. So if connection A cannot close, we still want to try to close it on B, so we delete it from the table. You could say that our close is then pointless, as we don't close across all and we should only remove from the table, but we need to keep it working for transactions and other cases. In other words, I think we should retry. Alternatively, we don't retry, but we delete from the table sooner, but that may have other caveats.

handle_declare: no prepared statement to use for cursor after connection drops

I know we need a transaction for declare to work. But is that Postgrex requirements or a DBConnection one? 🤔 If DBConnection, you are right, there is no need to support retries on declare.

@josevalim
Copy link
Member

I know we need a transaction for declare to work. But is that Postgrex requirements or a DBConnection one? 🤔 If DBConnection, you are right, there is no need to support retries on declare.

It is a DBConnection requirement, so I changed it so we don't support retry on handle_declare.

@greg-rychlewski
Copy link
Member Author

Beautiful!

handle_close: no prepared statement to close after connection drops

it has been a while but my understanding is that we synchronize prepared queries across connections by storing queries in the ETS table and then deleting them on close. So if connection A cannot close, we still want to try to close it on B, so we delete it from the table. You could say that our close is then pointless, as we don't close across all and we should only remove from the table, but we need to keep it working for transactions and other cases. In other words, I think we should retry. Alternatively, we don't retry, but we delete from the table sooner, but that may have other caveats.

You’re absolutely right! I’ll add it

@greg-rychlewski
Copy link
Member Author

@josevalim after this pr is merged would you be alright if I try to submit another one handling bind messages? I was thinking about it more and probably a large portion of failures will happen from people trying to execute already cached queries. So trying to catch the closure at the bind phase is probably worth it?

@josevalim
Copy link
Member

@greg-rychlewski yes, we 100% should. Feel free to merge this in or push it as part of this PR. I will make sure DBConnection allows retry on execute.

@josevalim
Copy link
Member

will make sure DBConnection allows retry on execute.

We do! 🥳

@greg-rychlewski
Copy link
Member Author

ok it looks like it's all working now :). i left a couple comments where to explain a couple choices

@josevalim
Copy link
Member

Hi @greg-rychlewski! I have dropped some comments. I am wondering if we should move the handle_disconnect_retry inside on all operations? For example, in handle_prepare, we could change parse_describe_flush and do recv_parse_describe(s, status, query, buffer) |> retryable()? Note though we do not need to do it on savepoint mode branches. I think this will at least be safer when it comes to execute, as there is no change we accidentally rerun the whole query?

@josevalim
Copy link
Member

@greg-rychlewski gah, nevermind. I played with the idea above a bit and there are issues for things like reload_prepare, which is used for fetch too, which cannot be tried.

So I think we need to mix and match approaches after all.

@josevalim
Copy link
Member

Ok, given everything above, here are the final changes I propose:

diff --git a/lib/postgrex/protocol.ex b/lib/postgrex/protocol.ex
index 883de6f..a71685e 100644
--- a/lib/postgrex/protocol.ex
+++ b/lib/postgrex/protocol.ex
@@ -433,8 +433,7 @@ defmodule Postgrex.Protocol do
         handle_execute_copy(query, params, opts, s)

       false ->
-        result = handle_execute_result(query, params, opts, s)
-        handle_disconnect_retry(result)
+        handle_execute_result(query, params, opts, s)
     end
   end

@@ -2127,7 +2126,7 @@ defmodule Postgrex.Protocol do
     ]

     with :ok <- msg_send(%{s | buffer: nil}, msgs, buffer),
-         {:ok, s, buffer} <- recv_bind(s, status, buffer),
+         {:ok, s, buffer} <- recv_bind(s, status, buffer) |> handle_disconnect_retry(),
          {:ok, result, s, buffer} <- recv_execute(s, status, query, buffer),
          {:ok, s, buffer} <- recv_close(s, status, buffer),
          {:ok, s} <- recv_ready(s, status, buffer) do
@@ -2137,7 +2136,7 @@ defmodule Postgrex.Protocol do
         error_ready(s, status, err, buffer)
         |> maybe_disconnect()

-      {:disconnect, _err, _s} = disconnect ->
+      {_disconnect_or_retry, _err, _s} = disconnect ->
         disconnect
     end
   end
@@ -2164,7 +2163,7 @@ defmodule Postgrex.Protocol do
     ]

     with :ok <- msg_send(%{s | buffer: nil}, msgs, buffer),
-         {:ok, s, buffer} <- recv_bind(s, status, buffer),
+         {:ok, s, buffer} <- recv_bind(s, status, buffer) |> handle_disconnect_retry(),
          {:ok, result, s, buffer} <- recv_execute(s, status, query, buffer),
          {:ok, s} <- recv_ready(s, status, buffer) do
       {:ok, query, result, s}
@@ -2175,7 +2174,7 @@ defmodule Postgrex.Protocol do
         error_ready(s, status, err, buffer)
         |> maybe_disconnect()

-      {:disconnect, _err, _s} = disconnect ->
+      {_disconnect_or_retry, _err, _s} = disconnect ->
         disconnect
     end
   end
@@ -2317,9 +2316,7 @@ defmodule Postgrex.Protocol do
         recv_execute(s, status, query, rows, buffer)

       {:disconnect, _, _} = dis ->
-        with {_, %{reason: :closed} = err, s} <- dis do
-          {:disconnect, %{err | reason: :execute_closed}, s}
-        end
+        dis
     end
   end

@greg-rychlewski
Copy link
Member Author

@josevalim Thanks for the feedback. Going to apply your suggestions but something crossed my mind after reading what you said. I think any of these things can be part of a transaction (even if :mode != :savepoint), in which case if I'm remembering the DBConnection changes correctly we would retry the entire transaction if a connection drops in the middle.

Am I reading too much into it or do you think we would potentially cause problems if people are putting side effects in their transactions. They can always change the retry option to 0 but I was wondering if it changes silently on them all of a sudden should we consider making it opt-in.

@greg-rychlewski
Copy link
Member Author

@josevalim Btw I also think we might be able to move everything lower. I had the same reservation about the reload stuff when I first made the PR but I just looked at it again and I think reload_fetch here actually means "fetch the types again when I try to prepare and don't recognize an oid" rather than anything about cursors.

@greg-rychlewski
Copy link
Member Author

I guess we have the :postgres value in the state. So if we move everything lower level we can not retry if its value is :transaction. Just have to be a bit careful I think because sometimes its a tuple with the last prepared query reference, I believe.

@greg-rychlewski
Copy link
Member Author

greg-rychlewski commented Nov 13, 2025

Ok I made some changes that I believe are improvements based on all the things you said:

  1. Moved the retry handling lower. I believe it's fine because the fetch in these situations was referring to fetching type info and not cursors
  2. Check if the status is :transaction and don't retry if that's the case. So basically will not retry if something fails mid transaction because we don't know what all else will be retried
  3. For handle_execute retry on both receiving bind and sending the execution message. Because most of the time it will fail on sending the message before it receives any data back I believe we need that too

@josevalim
Copy link
Member

@greg-rychlewski if the connection is checked out or in a transaction, DBConnection converts disconnect_and_retry back to disconnect, so that's not a concern.

In any case, I posted my review. I think we can either:

  1. Ship this version once my comments are addressed
  2. Push the previous version with my patch applied

I am fine either way, your call!

@greg-rychlewski
Copy link
Member Author

greg-rychlewski commented Nov 13, 2025

@josevalim Thanks! I reverted back and applied your changes. The only thing I added was the retry handling to msg_send in addition to recv_bind in the execute handling. The reason is because if the connection has been dropped for a while it will fail there.

Co-authored-by: José Valim <jose.valim@gmail.com>
@greg-rychlewski greg-rychlewski merged commit 5ce8c28 into elixir-ecto:master Nov 13, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants