Skip to content

Conversation

@Fly-Style
Copy link
Contributor

@Fly-Style Fly-Style commented Jan 21, 2026

Core ideas around the change

  • Scale up more aggressively when per-partition lag is meaningful.
  • Relax the PPT (partitions-per-task) increase limit based on lag severity and headroom.
  • Keep behavior conservative near taskCountMax and avoid negative headroom effects.

Key changes

  • Lag-proportional PPT relaxation: add rawExtra when avg lag exceeds thresholds, plus a step function for higher lag.
  • Headroom-aware scaling: multiply by max(0, 1 - currentTasks/maxTasks) to reduce aggressiveness near max and avoid negative scaling.
  • Lag-amplified idle decay: when lag/partition exceeds the activation threshold, the lag-based busy factor is multiplied
    by a ramped multiplier (1.0x at 50K to 2.0x at 500K) to reduce predicted idle more aggressively during high lag.

Expected behavior (for example, 30 partitions, taskCountMax=30)

  • 3 tasks: 50K lag/partition → 8 tasks; 300K → 15 tasks; 500K+ → 30 tasks.
  • 10 tasks: 100K → 15; 300K → 30; 500K → 30.
  • 20/25 tasks (PPT=1): scale-up capped only by taskCountMax.
    Note: examples assume lag-dominant conditions (low/mide processing rate, low idle).

Open questions / risks

  • Lag skew: using aggregateLag / partitionCount can under-react if a few partitions are far behind. Consider max/percentile lag if available.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@Fly-Style Fly-Style force-pushed the cost-autoscaler-adjustments branch from 0fce534 to 2a79747 Compare January 21, 2026 14:18

private static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2;
private static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;
private static final int LAG_STEP = 100_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of magic-looking constants here. Please add javadocs to these constants describing what they do, and consider whether any of them should be configurable to allow for experimentation without changing the code. (Of course, ideally, users do not need to configure anything.)

Copy link
Contributor Author

@Fly-Style Fly-Style Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add javadocs; I am following common sense here in terms of specific numbers.

Of course, ideally, users do not need to configure anything.

That's what I'm looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: e97ce36

Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concerns echo Gian's thoughts on the constants and their ability to be universally (or near universally applicable) versus generally sane for most, but configurable for those select few who need it. I do also think opening this up to too much configurability could also be self defeating in that it may be difficult for the average operator to tune such a tool.

In lieu of opening up these inputs for configuration, documenting the why behind them as Gian suggested would be a good place to start. I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website? This would be to try and head off my main longer term concern, which is the cost function becoming mysterious and unwieldy as folks from the community come and go over time.


/**
* Computes extra allowed increase in partitions-per-task in scenarios when the average
* per-partition lag is relatively high.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relatively high compared to what? I think I'm failing to grasp how the constants for lag thresholds can be so generally applicable that we don't need to open them up to configuration. I think this idea of allowing for a burst in PPT scale factor is reasonable, but is this specific impl coming from tuning for too specific of a case? I think this goes along with Gian's question on the constants above.

Copy link
Contributor Author

@Fly-Style Fly-Style Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we may make those constants configurable, but now they are baked only by my observations as 'good' defaults.

I thought even more and I make some constants configurable (like lag threshold) even in the current PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally nothing needs to be configured although I also feel that it may be useful for now to allow the lag threshold to be configured. The reason is that it's expressed in number of messages, but in practice, processing time of a message can vary quite a lot. Messages can be 100 bytes and processed very quickly, or they can be 10KB and processed more slowly. The difference between these two extremes is 100x so one number does not fit all. A lag of 100,000 messages could represent less than a second of processing, or could represent a minute of processing.

Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time. But that doesn't need to be done now. Probably for now we can set the initial thresholds assuming some medium rate of processing, like assuming we can process 5,000 or 10,000 messages per second per task.

Copy link
Contributor Author

@Fly-Style Fly-Style Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I think we want the decision-making to be based on the system's assessment of how many messages it can process per unit of time

That's an ideal scenario, and actually what I am looking for too in the long run.

Now I want to have some good defaults to start from, and your assumptions are close to what is in the codebase already: assuming that we're processing ~10.000 messages from ~10 partitions (typical for mid-load Kafka topic).
Lag of 100.000 messages per partition is already inappropriate and may cause serious inconvenience to Druid users.

{
private static final Logger log = new Logger(WeightedCostFunction.class);

private static final double HIHG_LAG_SCALE_FACTOR = 100_000.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be HIGH or am I unfamiliar with what HIHG may stand for in math terms? I tried to google it in case it is an abbreviation I don't know, but didn't find anything so I tend to think maybe just misspelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is HIGH, just typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: e97ce36

/**
* Estimates the idle ratio for a given task count using a capacity-based linear model.
* Estimates the idle ratio for a proposed task count.
* Includes lag-based adjustment to eliminate high lag and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here again I am wondering about how generally applicable a constant for per partition lag in the real world. Same/similar questions as from the PPT scale limits computed in CostBasedAutoScaler.

Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO. If this added complexity is considered a negative or "cost", the positive of improved behavior should outweigh it. So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in addition to the above, I think adding in this lag consideration does add some complexity here. Mainly it generally starts us down the path of making the cost function harder to easily and quickly understand for a newcomer, IMO.

Sometimes you want to have complex things in the project, because they make some things work slightly better. A good example is query planner / query optimizer, which we have from the Calcite side. It's not easy to enter, hard to master, but with complexity it brings a good framework to start using SQL for your database.
Same here: in order to make supervisor autoscaling work well, we need to introduce a level of complexity baked by math (the formulas are described here: #18819). During the testing, I realized it is too conservative in the high lag scenarios, and it is an attempt to tweak it a bit.

I hook up your question from general comment:

I also wonder if a more in depth technical writeup once this feature is stabilized is in order. Something that explains the method to the madness and a bit of the math. Perhaps in a brief blog post or docs page within the apache Druid website?

We must do it, but the feature is not finally stabilized; anyway, it already has a decent base.

So I guess that begs the question, how did we or are we going to measure the improvement that this additional logic/computation provides?

That's a very good question, and I would answer in the following manner: the less time we spend scaling supervisors manually / fine-tuning an autoscaler, the better the result we will receive.

@Fly-Style
Copy link
Contributor Author

@capistrant thanks for the review.
About the documentation: check #18819, the algorigthm is described; this patch introduces a tweaks to the autoscaler about being too conservative on high lag scenarios.

@capistrant
Copy link
Contributor

@capistrant thanks for the review. About the documentation: check #18819, the algorigthm is described; this patch introduces a tweaks to the autoscaler about being too conservative on high lag scenarios.

thanks ya I was looking through that PR description to get a grasp on things. plus your javadocs are informative which is good. I guess lets table docs/blog talk until future, I got ahead of myself there. Math stresses me out too easily 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants