Skip to content

Conversation

@capistrant
Copy link
Contributor

@capistrant capistrant commented Jan 21, 2026

follow up to #18844 ... at least in terms of the quest to begin the transition from the term compaction to reindexing. More info can be found in that PR description about the naming change and new centralized indexing state storage that the supervisor uses to determine if segments need to be reindexed (replacement for lastCompactionState stored per segment). In this PR I will use the term reindexing whenever possible. When the term compaction is used it will only be to refer to an actual Java class that is yet to be refactored.

Description

Extend reindexing supervisors (AKA compaction supervisors) to allow for a single Druid data source to apply different reindexing configurations to different segments depending on what "reindexing rules" are defined for the data source that apply to the time interval of the segment being reindexed.

Value Proposition

Timeseries data often provides value in different ways over time. Data for the last 7 days is often interacted with differently than data for the last 30 days and once again for data from some number of years ago. In Druid, we should give data owners the ability to use reindexing supervisors (AKA compaction supervisors) to change the state of a single datasource as the data ages. Operators should have the ability to define a data lifecycle of sorts for their datasources and allow reindexing supervisors to dynamically apply that definition to the underlying segments as they age. A great, and simple, example is query granularity.

realtime use cases often call for fine grained query granularity as data is flowing into druid. I may want to analyze metrics by minute or even second. As that data ages to be days or months old, I may still want to be able to check trends for that data over time, but it is often highly unlikely that I care about second to second analysis for data that is past a certain age. And beyond that, there may be tiers of value. For instance, where I want second granularity for the last 24 hours, minute granularity for everything in the past week, and hour granularity for everything older than that. The new functionality in this PR intends to make achieving this as simple as defining a few rules for a datasource using reindexing supervisors, and allowing Druid to reindex the data as segments age out of one period and into another that calls for less finely grained query granularity.

Design

CompactionConfigBasedJobTemplate underpins all of this

The existing CompactionConfigBasedJobTemplate underpins all the new functionality. At the end of the day, the cascading reindexing template is creating these objects under the hood after dynamically coming up with the config. It uses the existing functionality in this template to be able to take that config, find segments that need reindexing, and create jobs for them.

ReindexingConfigFinalizer

This is a concept introduced to allow us to optimize the final config used to create the underlying tasks without leaking rule implementation details into this existing job template. It is the mechanism that we use to optimize the set of filter rules that are needed for an underlying CompactionConfig before jobs are created for the candidate.

Reindexing Rule

A reindexing rule defines what should be done to segments being reindexed. There are rule implementations for all components of the existing CompactionState construct.

rule period

All rules have an associated Joda time period (i.e. P7D --> 7 days) This is used to define when a rule should begin being applied. The core idea is that this is a time in the past relative to "now" and now being the point in time that the reindexing supervisor runs to create tasks to reindex the underlying data.

additive vs non additive rules

Some rule types are additive in that it logically makes sense to apply N of a single type to a single segment being reinedex. And then there are others where that is either illogical or physically impossible.

For example, a granularity rule to set Segment Granularity. Druid cannot have a segment with multiple segment granularities, therefore such a rule must be non-additive and reindex tasks where multiple granularity rules technically apply, only one can be selected.

  • the same can be said about tuning config, metrics, dimensions, and ioconfig rules

To the contrary, a filter rule is logically additive. It makes sense that an operator my want to filter out rows matching dim=foo for data older than 30 days and data=bar for data older than 90 days. For data older than 90 days, we don't want to just filter out dim=bar, but rather we want to filter out dim=foo OR dim=bar. Thus filter rules are additive

  • the same can be said about projections rules
Filter Rules

Some explicit structure is being added around the filtering that currently exists in the current compaction config transform spec. The driving force behind this design decision is that we do not want to apply filter rules to a CompactionCandidate that have already been applied to all segments in the candidate. To achieve this we need a deterministic pattern. That pattern is that each filter rule is a part of a NOT(X OR Y OR Z...) transformspec filter in the underlying reindeixing task where X, Y, and Z would be individual rules. Doing this allows us to easily identify and specify only unapplied rules for the tasks were are creating.

A conceptual example using inline rule syntax:

The following rule would remove all rows that have the dimension isRobot matching true

{
     "id": "no-robots",
     "period": "P6M",
     "filter": {
         "type": "selector",
         "dimension": "isRobot",
         "value": "true"
    }
}
Non-additive rule selection

For non-additive rule types, the rule provider implementation defines how rules are selected if an interval matches more than one. For the inline provider, we select the rule that is "older" as in P7D vs P1M we select P1M, and so on.

Reindexing Rule Provider

Rule providers are what supply defined rules to the reindexing supervisor at runtime. The reindexing supervisor collects the collection of applicable rules for CompactionCandidates and creates DataSourceCompactionConfig configurations that are fed into the existing CompactionConfigBasedJobTemplate to generate underlying Druid tasks to reindex data.

This PR adds the ReindexingRuleProvider Interface as well as a basic inline rule provider that can be used to define period based reindexing rules in the reindexing supervisor spec itself. Also provided in this PR is a composing rule provider that can be used to chain rule providers. This rule provider concept is meant to be easily extensible. It is possible (likely?) that core druid will supply more robust rule providers in the future. It is also reasonable to assume that community extensions can and will be created to add rule providers that extend the capabilities of core druid reindexing.

Inline Rule Provider

Below is a stripped down supervisor spec that demonstrates the spirit of the inline rule provider. It is built against the standard wikipedia schema used across Druid for tutorials and testing.

Highlights:

  • data older than 6M, remove rows where isRobot=true
  • data older than 7 days, reindex to DAY segment granularity and MINUTE query granularity
  • data older than 1 month, reindex to MONTH segment granularity and HOUR query granularity
{
  "type": "autocompact",
  "spec": {
    "type": "reindexCascade",
    "dataSource": "wikipedia",
    "ruleProvider": {
      "type": "inline",
      "reindexingFilterRules": [
        {
          "id": "no-robots",
          "period": "P6M",
          "filter": {
            "type": "selector",
            "dimension": "isRobot",
            "value": "true"
          }
        }
      ],
      "reindexingMetricsRules": [ ... ],
      "reindexingDimensionsRules": [ ... ],
      "reindexingIOConfigRules": [],
      "reindexingProjectionRules": [ ... ],
      "reindexingGranularityRules": [
        {
          "id": "day",
          "description": null,
          "period": "P7D",
          "granularityConfig": {
            "segmentGranularity": "DAY",
            "queryGranularity": "MINUTE",
            "rollup": true
          }
        },
        {
          "id": "month",
          "description": null,
          "period": "P1M",
          "granularityConfig": {
            "segmentGranularity": "MONTH",
            "queryGranularity": "HOUR",
            "rollup": true
          }
        }
      ],
      "reindexingTuningConfigRules": [
        {
          "id": "tuning",
          "description": "testing tuning rule",
          "period": "P7D",
          "tuningConfig": {
            ...
            "partitionsSpec": {
              "type": "range",
              "targetRowsPerSegment": null,
              "maxRowsPerSegment": 10000000,
              "partitionDimensions": [
                "countryName"
              ],
              "assumeGrouped": false
            },
            ...
          }
        }
      ]
    }
  },
  "suspended": false
}

Miscellaneous Notes

Supervisor Only Support

As we did in #18844, we only support this new functionality for reindexing supervisors (aka compaction supervisors) that run on the overlord. This is a conscious choice because we are moving Druid away from the legacy compaction duty for automatic compaction, in favor of these supervisors.

Follow Ups

  • Good documentation as an experimental feature in the automatic-compaction docs. Not sure if needed/wanted in this PR or separate as the volume of content may be large.
  • future data handling?
    • Right now we don't support applying rules to future data OR having rule periods define any component with a negative value

Release note


Key changed/added classes in this PR
  • CascadingReindexingTemplate
  • ReindexingConfigFinalizer
  • ReindexingRuleProvider
    • InlineReindexingRuleProvider
    • ComposingReindexingRuleProvider
  • ReindexingRule + AbstractReindexingRule
    • ReindexingDimensionsRule
    • ReindexingMetricsRule
    • ReindexingGranularityRule
    • ReindexingIOConfigRule
    • ReindexingProjectionRule
    • ReindexingFilterRule

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Improvements and bugfixes

Fix sompaction status after rebasing

Fix missing import after rebase

fix checkstyle issues

fill out javadocs

address claude code review comments

Add isReady concept to compaction rule provider and gate task creation on provider being ready

Fix an issue in AbstractRuleProvider when it comes to variable length periods like month and year

Implement a composing rule provider for chaining multiple rule providers
Using 1 row and creating 0 row segments makes the test fail for native compaction runner. I cannot reproduce in docker to figure out how the test is misconfigured
… issue with range dim and all rows filtered out
Comment on lines +285 to +289
return new Builder()
.forDataSource(this.dataSource)
.withTaskPriority(this.taskPriority)
.withInputSegmentSizeBytes(this.inputSegmentSizeBytes)
.withMaxRowsPerSegment(this.maxRowsPerSegment)

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
Builder.withMaxRowsPerSegment
should be avoided because it has been deprecated.
@capistrant capistrant changed the title [WIP] Reindexing rule providers with cascading reindexing [WIP] Reindexing rule providers with cascading interval based reindexing Jan 22, 2026
@capistrant capistrant changed the title [WIP] Reindexing rule providers with cascading interval based reindexing Reindexing rule providers with cascading interval based reindexing Jan 23, 2026
@capistrant capistrant removed the WIP label Jan 23, 2026
@FrankChen021
Copy link
Member

So, in future, there's no compaction term?

However, I have a different view. IMO, the compaction and re-indexing should be separated from each other, the should serve complete different purposes

Compaction should only performs the merge of small segments without any schema changs(query granularity, segment granularity). The compaction should perform eagerly and aggresively especially for kafka ingestion to reduce number of segment. There're many problems/limitation around this feature that have not solved. for example, the compaction now performs compaction on a whole interval, if there're many segments, it takes very long time(and sometimes it's not realistic to complete) to finish the job. This kind of compaction was originally named as 'Majar compaction', while a 'minor compaction' is there to allow us to compact given segments, but it's buggy now, even we give 2 segments for example, the task will still fetch all segments in that interval. And another problem is that the minor compaction only accepts segments with consecutive segments. these problems are states in: #9712 , #9768, #9571
However, these problems are not solved, and we still experience the large number of small segments for long.
Apply the re-indexing term for this use case, I think the term itself does not reflect its feature but introduces confusion.

@capistrant
Copy link
Contributor Author

So, in future, there's no compaction term?

However, I have a different view. IMO, the compaction and re-indexing should be separated from each other, the should serve complete different purposes

Compaction should only performs the merge of small segments without any schema changs(query granularity, segment granularity). The compaction should perform eagerly and aggresively especially for kafka ingestion to reduce number of segment. There're many problems/limitation around this feature that have not solved. for example, the compaction now performs compaction on a whole interval, if there're many segments, it takes very long time(and sometimes it's not realistic to complete) to finish the job. This kind of compaction was originally named as 'Majar compaction', while a 'minor compaction' is there to allow us to compact given segments, but it's buggy now, even we give 2 segments for example, the task will still fetch all segments in that interval. And another problem is that the minor compaction only accepts segments with consecutive segments. these problems are states in: #9712 , #9768, #9571 However, these problems are not solved, and we still experience the large number of small segments for long. Apply the re-indexing term for this use case, I think the term itself does not reflect its feature but introduces confusion.

I appreciate the thoughts @FrankChen021. In general, this push away from using the term compaction for everything that re-processes existing druid segments is long needed. But, I do agree that pure compaction like you spec out with minor compactions does still warrant being called "compaction". Whether that be as a subset of the "reindexing" space or its own separate concept entirely, I guess I don't know. Overall, we have lots of robust production ready code for "compaction" already that I could not justify re-building anything for "reindexing" specifically. That is the genesis of trying to generalize the name as I do think it does make more sense to call pure compaction, reindexing, than it does to call activity that changes the underlying data definition, compaction. I do want to work towards a naming scheme and code base that is logical and reasonable though, so I am open to considering how we can best navigate to a world where only stuff that is legitimately compaction is called compaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants