[WIP] Fix client can OOM when there are some bookies slow#4556
Open
dao-jun wants to merge 1 commit intoapache:masterfrom
Open
[WIP] Fix client can OOM when there are some bookies slow#4556dao-jun wants to merge 1 commit intoapache:masterfrom
dao-jun wants to merge 1 commit intoapache:masterfrom
Conversation
dlg99
reviewed
Mar 31, 2025
Contributor
dlg99
left a comment
There was a problem hiding this comment.
I don't think it is a good approach. There is a client backpressure (see the PRs that you linked) that should address the problem.
Quarantined bookies aren't necessary dead, it's a soft state where we are trying to not chose it for requests unless there are no other options. E.g. the bookie could be in a long GC and will come back even though a request timed out. Disconencting channels means longer process of re-connecting later.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to:
apache/pulsar#12169
apache/pulsar#9562
apache/pulsar#10439
#3139
apache/pulsar#14861
and etc.
Background
Our customer has 12 nodes bookie and 12 nodes broker cluster.
Pulsar version: 2.6.3
Bookkeeper: 4.11.1
They enabled bookkeeper client addEntryTimeout feature and set
addEntryTimeoutSecto 30At first, their EWA is 332, and they encountered Broker OOM exception.
According to apache/pulsar#12169, we recommended them set EWA to 222 and observe for a period of time
After a few days, they also encountered broker OOM exception.
So we suspect that the broker may have a memory leak and let them to enable Netty ByteBuf leak detector (Add
-Dpulsar.allocator.leak_detection=Paranoidto their broker vm args and restart).But when search
LEAKkeyword in their broker logs, their is no related logs which means no mem leaks in their broker.We found some logs
New ensemble: [aaa,bbb] is not adhering to Placement Policy. quarantinedBookies: [xxx]in their logs, andquarantinedBookiesis always same.We have observed the monitoring of this bookie and found that there has been no traffic entering for a long time(weeks), so we tried to restart the bookie, but it can be shutdown for a long time, until we
kill -9, which means this bookie maybe ran into thread blocking or sth else so that it can not respond requests.After we restart the bookie, there is no more broker OOM happened, brokers goes well.
When I analyze the broker heap dump, I found some Netty channels held a big number of DirectMemory, and all this channels connected to that quarantinedBookie:

There are 6 channels retained over 100MB DirectMemories each.
Due to our customer enabled
addEntryTimeoutfeature, so broker Backpressure won't work in this case.Enable
failfastcan prevent the situation from escalating, but it will not solve the root cause.If we set EWA to 332, and there is 1 bookie is SLOW or HANGING, OOM can also have a chance to happen.
If we set EWA to 222 and disable
addEntryTimeout, and there is 1 bookie is SLOW or HANGING, broker maybe can not serve requests.The key point is if there is a bookie is slow or hanging and we don't enable
failfast, client will keep sending data to it, even though the data cannot send-out. All the data will backlog in the client.Motivation
Fix bookkeeper client can be OOM if there is a bookie is SLOW or HANGING in the ensemble.
Changes
Close all the channel which connected to a quarantined Bookie to release memories.