Skip to content

Deadlock in MessageHandler.put_messages causes unbounded replication slot growth #2129

@g14a

Description

@g14a

Summary:

When SlotMessageStore buffer is full, MessageHandler.put_messages/3 blocks SlotProcessorServer with a Process.sleep retry loop (15 retries, exponential backoff, ~1.6s). This creates a deadlock because ConsumerProducer needs SlotProcessorServer to be responsive to drain the buffer, but it's sleeping. The buffer never drains, retries always fail, ReorderBuffer reschedules the same batch, and the cycle repeats forever. No LSN acks reach Postgres, so the replication slot grows unboundedly (we've seen 50-73GB).

Reproduction:

High-throughput WAL ingestion that fills SlotMessageStore beyond max_memory_bytes. Once payload_size_limit_exceeded fires, the system never recovers without a manual restart.

Workaround:

Clicking "Restart" on the database page in the UI breaks the deadlock by restarting the pipeline with a fresh empty buffer but that again gets filled up quite quickly.

I personally come from Go background and I haven't understood Elixir code really well since this project is a little complex for me. So I used AI to help me navigate the code and debug this issue. But the issue we are facing is real. We have production systems with more than 100GB lag since the past 3 months.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions