Reindexing rule providers with cascading interval based reindexing #18939

capistrant · 2026-01-21T22:33:13Z

follow up to #18844 ... at least in terms of the quest to begin the transition from the term compaction to reindexing. More info can be found in that PR description about the naming change and new centralized indexing state storage that the supervisor uses to determine if segments need to be reindexed (replacement for lastCompactionState stored per segment). In this PR I will use the term reindexing whenever possible. When the term compaction is used it will only be to refer to an actual Java class that is yet to be refactored.

Description

Extend reindexing supervisors (AKA compaction supervisors) to allow for a single Druid data source to apply different reindexing configurations to different segments depending on what "reindexing rules" are defined for the data source that apply to the time interval of the segment being reindexed.

Value Proposition

Timeseries data often provides value in different ways over time. Data for the last 7 days is often interacted with differently than data for the last 30 days and once again for data from some number of years ago. In Druid, we should give data owners the ability to use reindexing supervisors (AKA compaction supervisors) to change the state of a single datasource as the data ages. Operators should have the ability to define a data lifecycle of sorts for their datasources and allow reindexing supervisors to dynamically apply that definition to the underlying segments as they age. A great, and simple, example is query granularity.

realtime use cases often call for fine grained query granularity as data is flowing into druid. I may want to analyze metrics by minute or even second. As that data ages to be days or months old, I may still want to be able to check trends for that data over time, but it is often highly unlikely that I care about second to second analysis for data that is past a certain age. And beyond that, there may be tiers of value. For instance, where I want second granularity for the last 24 hours, minute granularity for everything in the past week, and hour granularity for everything older than that. The new functionality in this PR intends to make achieving this as simple as defining a few rules for a datasource using reindexing supervisors, and allowing Druid to reindex the data as segments age out of one period and into another that calls for less finely grained query granularity.

Design

CompactionConfigBasedJobTemplate underpins all of this

The existing CompactionConfigBasedJobTemplate underpins all the new functionality. At the end of the day, the cascading reindexing template is creating these objects under the hood after dynamically coming up with the config. It uses the existing functionality in this template to be able to take that config, find segments that need reindexing, and create jobs for them.

ReindexingConfigFinalizer

This is a concept introduced to allow us to optimize the final config used to create the underlying tasks without leaking rule implementation details into this existing job template. It is the mechanism that we use to optimize the set of filter rules that are needed for an underlying CompactionConfig before jobs are created for the candidate.

Reindexing Rule

A reindexing rule defines what should be done to segments being reindexed. There are rule implementations for all components of the existing CompactionState construct.

rule period

All rules have an associated Joda time period (i.e. P7D --> 7 days) This is used to define when a rule should begin being applied. The core idea is that this is a time in the past relative to "now" and now being the point in time that the reindexing supervisor runs to create tasks to reindex the underlying data.

additive vs non additive rules

Some rule types are additive in that it logically makes sense to apply N of a single type to a single segment being reinedex. And then there are others where that is either illogical or physically impossible.

For example, a granularity rule to set Segment Granularity. Druid cannot have a segment with multiple segment granularities, therefore such a rule must be non-additive and reindex tasks where multiple granularity rules technically apply, only one can be selected.

the same can be said about tuning config, metrics, dimensions, and ioconfig rules

To the contrary, a filter rule is logically additive. It makes sense that an operator my want to filter out rows matching dim=foo for data older than 30 days and data=bar for data older than 90 days. For data older than 90 days, we don't want to just filter out dim=bar, but rather we want to filter out dim=foo OR dim=bar. Thus filter rules are additive

the same can be said about projections rules

Filter Rules

Some explicit structure is being added around the filtering that currently exists in the current compaction config transform spec. The driving force behind this design decision is that we do not want to apply filter rules to a CompactionCandidate that have already been applied to all segments in the candidate. To achieve this we need a deterministic pattern. That pattern is that each filter rule is a part of a NOT(X OR Y OR Z...) transformspec filter in the underlying reindeixing task where X, Y, and Z would be individual rules. Doing this allows us to easily identify and specify only unapplied rules for the tasks were are creating.

A conceptual example using inline rule syntax:

The following rule would remove all rows that have the dimension isRobot matching true

{
     "id": "no-robots",
     "period": "P6M",
     "filter": {
         "type": "selector",
         "dimension": "isRobot",
         "value": "true"
    }
}

Non-additive rule selection

For non-additive rule types, the rule provider implementation defines how rules are selected if an interval matches more than one. For the inline provider, we select the rule that is "older" as in P7D vs P1M we select P1M, and so on.

Reindexing Rule Provider

Rule providers are what supply defined rules to the reindexing supervisor at runtime. The reindexing supervisor collects the collection of applicable rules for CompactionCandidates and creates DataSourceCompactionConfig configurations that are fed into the existing CompactionConfigBasedJobTemplate to generate underlying Druid tasks to reindex data.

This PR adds the ReindexingRuleProvider Interface as well as a basic inline rule provider that can be used to define period based reindexing rules in the reindexing supervisor spec itself. Also provided in this PR is a composing rule provider that can be used to chain rule providers. This rule provider concept is meant to be easily extensible. It is possible (likely?) that core druid will supply more robust rule providers in the future. It is also reasonable to assume that community extensions can and will be created to add rule providers that extend the capabilities of core druid reindexing.

Inline Rule Provider

Below is a stripped down supervisor spec that demonstrates the spirit of the inline rule provider. It is built against the standard wikipedia schema used across Druid for tutorials and testing.

Highlights:

data older than 6M, remove rows where isRobot=true
data older than 7 days, reindex to DAY segment granularity and MINUTE query granularity
data older than 1 month, reindex to MONTH segment granularity and HOUR query granularity

{
  "type": "autocompact",
  "spec": {
    "type": "reindexCascade",
    "dataSource": "wikipedia",
    "ruleProvider": {
      "type": "inline",
      "reindexingFilterRules": [
        {
          "id": "no-robots",
          "period": "P6M",
          "filter": {
            "type": "selector",
            "dimension": "isRobot",
            "value": "true"
          }
        }
      ],
      "reindexingMetricsRules": [ ... ],
      "reindexingDimensionsRules": [ ... ],
      "reindexingIOConfigRules": [],
      "reindexingProjectionRules": [ ... ],
      "reindexingGranularityRules": [
        {
          "id": "day",
          "description": null,
          "period": "P7D",
          "granularityConfig": {
            "segmentGranularity": "DAY",
            "queryGranularity": "MINUTE",
            "rollup": true
          }
        },
        {
          "id": "month",
          "description": null,
          "period": "P1M",
          "granularityConfig": {
            "segmentGranularity": "MONTH",
            "queryGranularity": "HOUR",
            "rollup": true
          }
        }
      ],
      "reindexingTuningConfigRules": [
        {
          "id": "tuning",
          "description": "testing tuning rule",
          "period": "P7D",
          "tuningConfig": {
            ...
            "partitionsSpec": {
              "type": "range",
              "targetRowsPerSegment": null,
              "maxRowsPerSegment": 10000000,
              "partitionDimensions": [
                "countryName"
              ],
              "assumeGrouped": false
            },
            ...
          }
        }
      ]
    }
  },
  "suspended": false
}

Miscellaneous Notes

Supervisor Only Support

As we did in #18844, we only support this new functionality for reindexing supervisors (aka compaction supervisors) that run on the overlord. This is a conscious choice because we are moving Druid away from the legacy compaction duty for automatic compaction, in favor of these supervisors.

Follow Ups

Good documentation as an experimental feature in the automatic-compaction docs. Not sure if needed/wanted in this PR or separate as the volume of content may be large.
future data handling?
- Right now we don't support applying rules to future data OR having rule periods define any component with a negative value

Release note

Key changed/added classes in this PR

CascadingReindexingTemplate
ReindexingConfigFinalizer
ReindexingRuleProvider
- InlineReindexingRuleProvider
- ComposingReindexingRuleProvider
ReindexingRule + AbstractReindexingRule
- ReindexingDimensionsRule
- ReindexingMetricsRule
- ReindexingGranularityRule
- ReindexingIOConfigRule
- ReindexingProjectionRule
- ReindexingFilterRule

This PR has:

Improvements and bugfixes Fix sompaction status after rebasing Fix missing import after rebase fix checkstyle issues fill out javadocs address claude code review comments Add isReady concept to compaction rule provider and gate task creation on provider being ready Fix an issue in AbstractRuleProvider when it comes to variable length periods like month and year Implement a composing rule provider for chaining multiple rule providers

…ding reindex

Using 1 row and creating 0 row segments makes the test fail for native compaction runner. I cannot reproduce in docker to figure out how the test is misconfigured

… issue with range dim and all rows filtered out

...rc/main/java/org/apache/druid/server/coordinator/InlineSchemaDataSourceCompactionConfig.java

+    return new Builder()
+        .forDataSource(this.dataSource)
+        .withTaskPriority(this.taskPriority)
+        .withInputSegmentSizeBytes(this.inputSegmentSizeBytes)
+        .withMaxRowsPerSegment(this.maxRowsPerSegment)


server/src/test/java/org/apache/druid/server/compaction/CompactionStatusTest.java

… is going to be a bad time

…eindexing

FrankChen021 · 2026-01-23T02:55:09Z

So, in future, there's no compaction term?

However, I have a different view. IMO, the compaction and re-indexing should be separated from each other, the should serve complete different purposes

Compaction should only performs the merge of small segments without any schema changs(query granularity, segment granularity). The compaction should perform eagerly and aggresively especially for kafka ingestion to reduce number of segment. There're many problems/limitation around this feature that have not solved. for example, the compaction now performs compaction on a whole interval, if there're many segments, it takes very long time(and sometimes it's not realistic to complete) to finish the job. This kind of compaction was originally named as 'Majar compaction', while a 'minor compaction' is there to allow us to compact given segments, but it's buggy now, even we give 2 segments for example, the task will still fetch all segments in that interval. And another problem is that the minor compaction only accepts segments with consecutive segments. these problems are states in: #9712 , #9768, #9571
However, these problems are not solved, and we still experience the large number of small segments for long.
Apply the re-indexing term for this use case, I think the term itself does not reflect its feature but introduces confusion.

capistrant · 2026-01-23T17:16:50Z

So, in future, there's no compaction term?

However, I have a different view. IMO, the compaction and re-indexing should be separated from each other, the should serve complete different purposes

Compaction should only performs the merge of small segments without any schema changs(query granularity, segment granularity). The compaction should perform eagerly and aggresively especially for kafka ingestion to reduce number of segment. There're many problems/limitation around this feature that have not solved. for example, the compaction now performs compaction on a whole interval, if there're many segments, it takes very long time(and sometimes it's not realistic to complete) to finish the job. This kind of compaction was originally named as 'Majar compaction', while a 'minor compaction' is there to allow us to compact given segments, but it's buggy now, even we give 2 segments for example, the task will still fetch all segments in that interval. And another problem is that the minor compaction only accepts segments with consecutive segments. these problems are states in: #9712 , #9768, #9571 However, these problems are not solved, and we still experience the large number of small segments for long. Apply the re-indexing term for this use case, I think the term itself does not reflect its feature but introduces confusion.

I appreciate the thoughts @FrankChen021. In general, this push away from using the term compaction for everything that re-processes existing druid segments is long needed. But, I do agree that pure compaction like you spec out with minor compactions does still warrant being called "compaction". Whether that be as a subset of the "reindexing" space or its own separate concept entirely, I guess I don't know. Overall, we have lots of robust production ready code for "compaction" already that I could not justify re-building anything for "reindexing" specifically. That is the genesis of trying to generalize the name as I do think it does make more sense to call pure compaction, reindexing, than it does to call activity that changes the underlying data definition, compaction. I do want to work towards a naming scheme and code base that is logical and reasonable though, so I am open to considering how we can best navigate to a world where only stuff that is legitimately compaction is called compaction.

capistrant added 11 commits January 21, 2026 16:13

Renaming refactor

ae21c82

add support for more config knobs in the cascading reindex spec

71830fc

add some testing

1940267

stop using forbidden apis

bdb514e

working on some test coverage

0f71da2

simplify search interval creation and enhance embedded test for casca…

acad478

…ding reindex

fixup checkstyle

d49fbf4

temporary fixup to test

5efb4bf

Using 1 row and creating 0 row segments makes the test fail for native compaction runner. I cannot reproduce in docker to figure out how the test is misconfigured

fix checkstyle

5a87759

remove native runner for one compaction supervisor test due to native…

0850a8f

… issue with range dim and all rows filtered out

capistrant added the WIP label Jan 21, 2026

github-actions bot added the Area - Ingestion label Jan 21, 2026

github-advanced-security bot found potential problems Jan 21, 2026

View reviewed changes

refactorings from self review

3667456

capistrant added the Area - Compaction label Jan 22, 2026

capistrant added 2 commits January 22, 2026 12:06

Fixup naming to prefer reindexing over compaction

a280690

fix up a javadoc with up to date design spec

3108b61

capistrant changed the title ~~[WIP] Reindexing rule providers with cascading reindexing~~ [WIP] Reindexing rule providers with cascading interval based reindexing Jan 22, 2026

capistrant added 11 commits January 22, 2026 14:31

Fill in UT gaps for the composing provider

1f671e1

refactor test class for inline rule provider

3be0da1

Self review refactorings

79ff44b

Trying to transform cascadingreindexingtemplate to a compaction state…

420f3b2

… is going to be a bad time

refactor the location of the reindexing filter rule optimizer

bf2e02d

Refactor this idea of additivity and how it works for building configs

5b4f3d2

Add a missing test class

6f5ead7

fix checkstyle

6d1fc6e

Merge branch 'master' into reindexing-rule-providers-with-cascading-r…

799db27

…eindexing

clean up a javadoc

6853b02

trivial fixes

2200467

capistrant changed the title ~~[WIP] Reindexing rule providers with cascading interval based reindexing~~ Reindexing rule providers with cascading interval based reindexing Jan 23, 2026

capistrant removed the WIP label Jan 23, 2026

capistrant added 2 commits January 23, 2026 14:55

Prevent an edge case for a negative period

a0d68eb

fix a nasty bug opportunity

bbb5bbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindexing rule providers with cascading interval based reindexing #18939

Reindexing rule providers with cascading interval based reindexing #18939

capistrant commented Jan 21, 2026 •

edited

Loading

Uh oh!

Check notice

Uh oh!

Uh oh!

Uh oh!

FrankChen021 commented Jan 23, 2026

Uh oh!

capistrant commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reindexing rule providers with cascading interval based reindexing #18939

Are you sure you want to change the base?

Reindexing rule providers with cascading interval based reindexing #18939

Conversation

capistrant commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Value Proposition

Design

CompactionConfigBasedJobTemplate underpins all of this

ReindexingConfigFinalizer

Reindexing Rule

rule period

additive vs non additive rules

Filter Rules

Non-additive rule selection

Reindexing Rule Provider

Inline Rule Provider

Miscellaneous Notes

Supervisor Only Support

Follow Ups

Release note

Key changed/added classes in this PR

Uh oh!

Check notice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankChen021 commented Jan 23, 2026

Uh oh!

capistrant commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

capistrant commented Jan 21, 2026 •

edited

Loading