Deadlock in MessageHandler.put_messages causes unbounded replication slot growth

Summary: 

When `SlotMessageStore` buffer is full, `MessageHandler.put_messages/3` blocks `SlotProcessorServer` with a `Process.sleep` retry loop (15 retries, exponential backoff, ~1.6s). This creates a deadlock because `ConsumerProducer` needs `SlotProcessorServer` to be responsive to drain the buffer, but it's sleeping. The buffer never drains, retries always fail, `ReorderBuffer` reschedules the same batch, and the cycle repeats forever. No LSN acks reach Postgres, so the replication slot grows unboundedly (we've seen 50-73GB).

Reproduction: 

High-throughput WAL ingestion that fills `SlotMessageStore` beyond `max_memory_bytes`. Once `payload_size_limit_exceeded` fires, the system never recovers without a manual restart.

Workaround: 

Clicking "Restart" on the database page in the UI breaks the deadlock by restarting the pipeline with a fresh empty buffer but that again gets filled up quite quickly.

I personally come from Go background and I haven't understood Elixir code really well since this project is a little complex for me. So I used AI to help me navigate the code and debug this issue. But the issue we are facing is real. We have production systems with more than 100GB lag since the past 3 months.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in MessageHandler.put_messages causes unbounded replication slot growth #2129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock in MessageHandler.put_messages causes unbounded replication slot growth #2129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions