-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Summary:
When SlotMessageStore buffer is full, MessageHandler.put_messages/3 blocks SlotProcessorServer with a Process.sleep retry loop (15 retries, exponential backoff, ~1.6s). This creates a deadlock because ConsumerProducer needs SlotProcessorServer to be responsive to drain the buffer, but it's sleeping. The buffer never drains, retries always fail, ReorderBuffer reschedules the same batch, and the cycle repeats forever. No LSN acks reach Postgres, so the replication slot grows unboundedly (we've seen 50-73GB).
Reproduction:
High-throughput WAL ingestion that fills SlotMessageStore beyond max_memory_bytes. Once payload_size_limit_exceeded fires, the system never recovers without a manual restart.
Workaround:
Clicking "Restart" on the database page in the UI breaks the deadlock by restarting the pipeline with a fresh empty buffer but that again gets filled up quite quickly.
I personally come from Go background and I haven't understood Elixir code really well since this project is a little complex for me. So I used AI to help me navigate the code and debug this issue. But the issue we are facing is real. We have production systems with more than 100GB lag since the past 3 months.