Adjust cost-based autoscaler algorithm #18936

Fly-Style · 2026-01-21T14:15:23Z

Core ideas around the change

Scale up more aggressively when per-partition lag is meaningful.
Relax the PPT (partitions-per-task) increase limit based on lag severity and headroom.
Keep behavior conservative near taskCountMax and avoid negative headroom effects.

Key changes

Lag-proportional PPT relaxation: add rawExtra when avg lag exceeds thresholds, plus a step function for higher lag.
Headroom-aware scaling: multiply by max(0, 1 - currentTasks/maxTasks) to reduce aggressiveness near max and avoid negative scaling.
Lag-amplified idle decay: when lag/partition exceeds the activation threshold, the lag-based busy factor is multiplied
by a ramped multiplier (1.0x at 50K to 2.0x at 500K) to reduce predicted idle more aggressively during high lag.

Expected behavior (for example, 30 partitions, `taskCountMax`=30)

3 tasks: 50K lag/partition → 8 tasks; 300K → 15 tasks; 500K+ → 30 tasks.
10 tasks: 100K → 15; 300K → 30; 500K → 30.
20/25 tasks (PPT=1): scale-up capped only by taskCountMax.
Note: examples assume lag-dominant conditions (low/mide processing rate, low idle).

Open questions / risks

Lag skew: using aggregateLag / partitionCount can under-react if a few partitions are far behind. Consider max/percentile lag if available.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

…uid into cost-autoscaler-adjustments

gianm · 2026-01-23T10:32:53Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java


  private static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2;
  private static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;
+  private static final int LAG_STEP = 100_000;


There's a lot of magic-looking constants here. Please add javadocs to these constants describing what they do, and consider whether any of them should be configurable to allow for experimentation without changing the code. (Of course, ideally, users do not need to configure anything.)

Will add javadocs; I am following common sense here in terms of specific numbers.

Of course, ideally, users do not need to configure anything.

That's what I'm looking for.

Fixed: e97ce36

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

capistrant

My biggest concerns echo Gian's thoughts on the constants and their ability to be universally (or near universally applicable) versus generally sane for most, but configurable for those select few who need it. I do also think opening this up to too much configurability could also be self defeating in that it may be difficult for the average operator to tune such a tool.

In lieu of opening up these inputs for configuration, documenting the why behind them as Gian suggested would be a good place to start. I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website? This would be to try and head off my main longer term concern, which is the cost function becoming mysterious and unwieldy as folks from the community come and go over time.

capistrant · 2026-01-23T15:02:03Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+
+  /**
+   * Computes extra allowed increase in partitions-per-task in scenarios when the average
+   * per-partition lag is relatively high.


relatively high compared to what? I think I'm failing to grasp how the constants for lag thresholds can be so generally applicable that we don't need to open them up to configuration. I think this idea of allowing for a burst in PPT scale factor is reasonable, but is this specific impl coming from tuning for too specific of a case? I think this goes along with Gian's question on the constants above.

Eventually, we may make those constants configurable, but now they are baked only by my observations as 'good' defaults.

I thought even more and I make some constants configurable (like lag threshold) even in the current PR.

Ideally nothing needs to be configured although I also feel that it may be useful for now to allow the lag threshold to be configured. The reason is that it's expressed in number of messages, but in practice, processing time of a message can vary quite a lot. Messages can be 100 bytes and processed very quickly, or they can be 10KB and processed more slowly. The difference between these two extremes is 100x so one number does not fit all. A lag of 100,000 messages could represent less than a second of processing, or could represent a minute of processing.

Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time. But that doesn't need to be done now. Probably for now we can set the initial thresholds assuming some medium rate of processing, like assuming we can process 5,000 or 10,000 messages per second per task.

Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time

That's an ideal scenario, and actually what I am looking for too in the long run.

Now I want to have some good defaults to start from, and your assumptions are close to what is in the codebase already: assuming that we're processing ~10.000 messages from ~10 partitions (typical for mid-load Kafka topic).
Lag of 100.000 messages per partition is already inappropriate and may cause serious inconvenience to Druid users.

capistrant · 2026-01-23T15:18:43Z

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java

 {
  private static final Logger log = new Logger(WeightedCostFunction.class);
-
+  private static final double HIHG_LAG_SCALE_FACTOR = 100_000.0;


is this supposed to be HIGH or am I unfamiliar with what HIHG may stand for in math terms? I tried to google it in case it is an abbreviation I don't know, but didn't find anything so I tend to think maybe just misspelled?

It is HIGH, just typo.

Fixed: e97ce36

capistrant · 2026-01-23T15:28:40Z

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java

  /**
-   * Estimates the idle ratio for a given task count using a capacity-based linear model.
+   * Estimates the idle ratio for a proposed task count.
+   * Includes lag-based adjustment to eliminate high lag and


I think here again I am wondering about how generally applicable a constant for per partition lag in the real world. Same/similar questions as from the PPT scale limits computed in CostBasedAutoScaler.

Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO. If this added complexity is considered a negative or "cost", the positive of improved behavior should outweigh it. So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?

Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO.

Sometimes you want to have complex things in the project, because they make some things work slightly better. A good example is query planner / query optimizer, which we have from the Calcite side. It's not easy to enter, hard to master, but with complexity it brings a good framework to start using SQL for your database.
Same here: in order to make supervisor autoscaling work well, we need to introduce a level of complexity baked by math (the formulas are described here: #18819). During the testing, I realized it is too conservative in the high lag scenarios, and it is an attempt to tweak it a bit.

I hook up your question from general comment:

I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website?

We must do it, but the feature is not finally stabilized; anyway, it already has a decent base.

So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?

That's a very good question, and I would answer in the following manner: the less time we spend scaling supervisors manually / fine-tuning an autoscaler, the better the result we will receive.

Fly-Style · 2026-01-23T16:09:00Z

@capistrant thanks for the review.
About the documentation: check #18819, the algorigthm is described; this patch introduces a tweaks to the autoscaler about being too conservative on high lag scenarios.

capistrant · 2026-01-23T16:25:02Z

@capistrant thanks for the review. About the documentation: check #18819, the algorigthm is described; this patch introduces a tweaks to the autoscaler about being too conservative on high lag scenarios.

thanks ya I was looking through that PR description to get a grasp on things. plus your javadocs are informative which is good. I guess lets table docs/blog talk until future, I got ahead of myself there. Math stresses me out too easily 😆

github-actions bot added the Area - Ingestion label Jan 21, 2026

Adjust cost-based autoscaler algorithm

2a79747

Fly-Style force-pushed the cost-autoscaler-adjustments branch from 0fce534 to 2a79747 Compare January 21, 2026 14:18

Fly-Style added 4 commits January 21, 2026 17:10

Actually enable autoscaler task rollover

632e2a7

Merge branch 'master' into cost-autoscaler-adjustments

18c0c7d

Checkstyle

41ff87a

Merge branch 'cost-autoscaler-adjustments' of github.com:Fly-Style/dr…

d3bbf4b

…uid into cost-autoscaler-adjustments

gianm reviewed Jan 23, 2026

View reviewed changes

capistrant reviewed Jan 23, 2026

View reviewed changes

Apply review comments

e97ce36

Adjust cost-based autoscaler algorithm #18936

Are you sure you want to change the base?

Adjust cost-based autoscaler algorithm #18936

Uh oh!

Conversation

Fly-Style commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core ideas around the change

Key changes

Expected behavior (for example, 30 partitions, taskCountMax=30)

Open questions / risks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fly-Style Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fly-Style Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fly-Style Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fly-Style commented Jan 23, 2026

Uh oh!

capistrant commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fly-Style commented Jan 21, 2026 •

edited

Loading

Expected behavior (for example, 30 partitions, `taskCountMax`=30)

Fly-Style Jan 23, 2026 •

edited

Loading

Fly-Style Jan 23, 2026 •

edited

Loading

Fly-Style Jan 23, 2026 •

edited

Loading