-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Adjust cost-based autoscaler algorithm #18936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
0fce534 to
2a79747
Compare
|
|
||
| private static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2; | ||
| private static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2; | ||
| private static final int LAG_STEP = 100_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of magic-looking constants here. Please add javadocs to these constants describing what they do, and consider whether any of them should be configurable to allow for experimentation without changing the code. (Of course, ideally, users do not need to configure anything.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add javadocs; I am following common sense here in terms of specific numbers.
Of course, ideally, users do not need to configure anything.
That's what I'm looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: e97ce36
...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
capistrant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concerns echo Gian's thoughts on the constants and their ability to be universally (or near universally applicable) versus generally sane for most, but configurable for those select few who need it. I do also think opening this up to too much configurability could also be self defeating in that it may be difficult for the average operator to tune such a tool.
In lieu of opening up these inputs for configuration, documenting the why behind them as Gian suggested would be a good place to start. I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website? This would be to try and head off my main longer term concern, which is the cost function becoming mysterious and unwieldy as folks from the community come and go over time.
|
|
||
| /** | ||
| * Computes extra allowed increase in partitions-per-task in scenarios when the average | ||
| * per-partition lag is relatively high. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relatively high compared to what? I think I'm failing to grasp how the constants for lag thresholds can be so generally applicable that we don't need to open them up to configuration. I think this idea of allowing for a burst in PPT scale factor is reasonable, but is this specific impl coming from tuning for too specific of a case? I think this goes along with Gian's question on the constants above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, we may make those constants configurable, but now they are baked only by my observations as 'good' defaults.
I thought even more and I make some constants configurable (like lag threshold) even in the current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally nothing needs to be configured although I also feel that it may be useful for now to allow the lag threshold to be configured. The reason is that it's expressed in number of messages, but in practice, processing time of a message can vary quite a lot. Messages can be 100 bytes and processed very quickly, or they can be 10KB and processed more slowly. The difference between these two extremes is 100x so one number does not fit all. A lag of 100,000 messages could represent less than a second of processing, or could represent a minute of processing.
Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time. But that doesn't need to be done now. Probably for now we can set the initial thresholds assuming some medium rate of processing, like assuming we can process 5,000 or 10,000 messages per second per task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time
That's an ideal scenario, and actually what I am looking for too in the long run.
Now I want to have some good defaults to start from, and your assumptions are close to what is in the codebase already: assuming that we're processing ~10.000 messages from ~10 partitions (typical for mid-load Kafka topic).
Lag of 100.000 messages per partition is already inappropriate and may cause serious inconvenience to Druid users.
| { | ||
| private static final Logger log = new Logger(WeightedCostFunction.class); | ||
|
|
||
| private static final double HIHG_LAG_SCALE_FACTOR = 100_000.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be HIGH or am I unfamiliar with what HIHG may stand for in math terms? I tried to google it in case it is an abbreviation I don't know, but didn't find anything so I tend to think maybe just misspelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is HIGH, just typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: e97ce36
| /** | ||
| * Estimates the idle ratio for a given task count using a capacity-based linear model. | ||
| * Estimates the idle ratio for a proposed task count. | ||
| * Includes lag-based adjustment to eliminate high lag and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here again I am wondering about how generally applicable a constant for per partition lag in the real world. Same/similar questions as from the PPT scale limits computed in CostBasedAutoScaler.
Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO. If this added complexity is considered a negative or "cost", the positive of improved behavior should outweigh it. So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO.
Sometimes you want to have complex things in the project, because they make some things work slightly better. A good example is query planner / query optimizer, which we have from the Calcite side. It's not easy to enter, hard to master, but with complexity it brings a good framework to start using SQL for your database.
Same here: in order to make supervisor autoscaling work well, we need to introduce a level of complexity baked by math (the formulas are described here: #18819). During the testing, I realized it is too conservative in the high lag scenarios, and it is an attempt to tweak it a bit.
I hook up your question from general comment:
I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website?
We must do it, but the feature is not finally stabilized; anyway, it already has a decent base.
So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?
That's a very good question, and I would answer in the following manner: the less time we spend scaling supervisors manually / fine-tuning an autoscaler, the better the result we will receive.
|
@capistrant thanks for the review. |
thanks ya I was looking through that PR description to get a grasp on things. plus your javadocs are informative which is good. I guess lets table docs/blog talk until future, I got ahead of myself there. Math stresses me out too easily 😆 |
Core ideas around the change
taskCountMaxand avoid negative headroom effects.Key changes
rawExtrawhen avg lag exceeds thresholds, plus a step function for higher lag.max(0, 1 - currentTasks/maxTasks)to reduce aggressiveness near max and avoid negative scaling.by a ramped multiplier (1.0x at 50K to 2.0x at 500K) to reduce predicted idle more aggressively during high lag.
Expected behavior (for example, 30 partitions,
taskCountMax=30)taskCountMax.Note: examples assume lag-dominant conditions (low/mide processing rate, low idle).
Open questions / risks
aggregateLag / partitionCountcan under-react if a few partitions are far behind. Consider max/percentile lag if available.This PR has: