Skip to content

feat: MSQ supervisors using mostFragmentedFirst policy and minor compaction can scale taskCount down for minor compactions#19412

Open
capistrant wants to merge 5 commits intoapache:masterfrom
capistrant:minor-compact-task-scaling
Open

feat: MSQ supervisors using mostFragmentedFirst policy and minor compaction can scale taskCount down for minor compactions#19412
capistrant wants to merge 5 commits intoapache:masterfrom
capistrant:minor-compact-task-scaling

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented May 5, 2026

Description

Minor compactions my not require the same parallelism that a full compaction does for a datasource. This change allows the compaction supervisor to add a taskContext parameter minorCompactionTaskPercent that scales maxTaskCount downwards for minor compaction tasks. For example

...
  "taskContext": {
    "maxTaskCount": 100,
    "minorCompactionTaskPercent": 20
  },
...

full compactions will use 100 tasks and minor compactions will use 20.

by default this behavior is disabled and each supervisor who wants to use it must be using the MSQ compaction engine + opt in by modifying minorCompactionTaskPercent explicitly.

Release note

If you are using MSQ task engine with compaction supervisors for automatic compaction plus MostFragmentedFirst policy with minor compactions enabled, you can now set a per supervisor configuration to use fewer MSQ workers for minor compactions compared to full compactions by setting "minorCompactionTaskPercent" to an int between 1 and 99. minor compactions will use FLOOR(2, maxTaskCount * minorCompactionTaskPercent) for minor compaction task parallelism.


Key changed/added classes in this PR
  • CompactSegments.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment on lines +174 to +184
final DataSegment segment = new DataSegment(
DATA_SOURCE,
Intervals.of("2024-01-01/2024-01-02"),
"v1",
null,
ImmutableList.of(),
ImmutableList.of(),
new NumberedShardSpec(0, 1),
0,
100L
);
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

This is an automated review by Codex GPT-5

return;
}

final int percent = QueryContext.of(context).getInt(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Validate minorCompactionTaskPercent when accepting the compaction config

The new context key is only parsed and range-checked while creating a minor MSQ compaction task. Supervisor config validation still accepts values like 0, 200, or a non-numeric string, so the API can persist an invalid supervisor and it will later fail job creation repeatedly once a minor compaction candidate appears. Please add validation alongside the existing MSQ maxNumTasks validation paths, including CascadingReindexingTemplate if applicable, so bad configs are rejected before scheduling.

Copy link
Copy Markdown
Contributor

@cecemei cecemei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i felt having a percent variable is not really necessary, especially we're just trying to multiple two variables to get a task count for minor compaction, this mental math is not really necessary.

what do you think if we just add another minNumTasks?

@capistrant
Copy link
Copy Markdown
Contributor Author

i felt having a percent variable is not really necessary, especially we're just trying to multiple two variables to get a task count for minor compaction, this mental math is not really necessary.

what do you think if we just add another minNumTasks?

how would this new variable be used?

I choose a percent so someone can just make an estimate and then not maintain it as they change their max task counts for full compaction, they will just keep the same ratio for minor:full when it comes to task counts

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.


This is an automated review by Codex GPT-5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants