-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Reindexing rule providers with cascading interval based reindexing #18939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Reindexing rule providers with cascading interval based reindexing #18939
Conversation
Improvements and bugfixes Fix sompaction status after rebasing Fix missing import after rebase fix checkstyle issues fill out javadocs address claude code review comments Add isReady concept to compaction rule provider and gate task creation on provider being ready Fix an issue in AbstractRuleProvider when it comes to variable length periods like month and year Implement a composing rule provider for chaining multiple rule providers
Using 1 row and creating 0 row segments makes the test fail for native compaction runner. I cannot reproduce in docker to figure out how the test is misconfigured
… issue with range dim and all rows filtered out
| return new Builder() | ||
| .forDataSource(this.dataSource) | ||
| .withTaskPriority(this.taskPriority) | ||
| .withInputSegmentSizeBytes(this.inputSegmentSizeBytes) | ||
| .withMaxRowsPerSegment(this.maxRowsPerSegment) |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note
Builder.withMaxRowsPerSegment
server/src/test/java/org/apache/druid/server/compaction/CompactionStatusTest.java
Fixed
Show fixed
Hide fixed
server/src/test/java/org/apache/druid/server/compaction/CompactionStatusTest.java
Fixed
Show fixed
Hide fixed
server/src/test/java/org/apache/druid/server/compaction/CompactionStatusTest.java
Fixed
Show fixed
Hide fixed
… is going to be a bad time
|
So, in future, there's no compaction term? However, I have a different view. IMO, the compaction and re-indexing should be separated from each other, the should serve complete different purposes Compaction should only performs the merge of small segments without any schema changs(query granularity, segment granularity). The compaction should perform eagerly and aggresively especially for kafka ingestion to reduce number of segment. There're many problems/limitation around this feature that have not solved. for example, the compaction now performs compaction on a whole interval, if there're many segments, it takes very long time(and sometimes it's not realistic to complete) to finish the job. This kind of compaction was originally named as 'Majar compaction', while a 'minor compaction' is there to allow us to compact given segments, but it's buggy now, even we give 2 segments for example, the task will still fetch all segments in that interval. And another problem is that the minor compaction only accepts segments with consecutive segments. these problems are states in: #9712 , #9768, #9571 |
I appreciate the thoughts @FrankChen021. In general, this push away from using the term compaction for everything that re-processes existing druid segments is long needed. But, I do agree that pure compaction like you spec out with minor compactions does still warrant being called "compaction". Whether that be as a subset of the "reindexing" space or its own separate concept entirely, I guess I don't know. Overall, we have lots of robust production ready code for "compaction" already that I could not justify re-building anything for "reindexing" specifically. That is the genesis of trying to generalize the name as I do think it does make more sense to call pure compaction, reindexing, than it does to call activity that changes the underlying data definition, compaction. I do want to work towards a naming scheme and code base that is logical and reasonable though, so I am open to considering how we can best navigate to a world where only stuff that is legitimately compaction is called compaction. |
follow up to #18844 ... at least in terms of the quest to begin the transition from the term compaction to reindexing. More info can be found in that PR description about the naming change and new centralized indexing state storage that the supervisor uses to determine if segments need to be reindexed (replacement for lastCompactionState stored per segment). In this PR I will use the term reindexing whenever possible. When the term compaction is used it will only be to refer to an actual Java class that is yet to be refactored.
Description
Extend reindexing supervisors (AKA compaction supervisors) to allow for a single Druid data source to apply different reindexing configurations to different segments depending on what "reindexing rules" are defined for the data source that apply to the time interval of the segment being reindexed.
Value Proposition
Timeseries data often provides value in different ways over time. Data for the last 7 days is often interacted with differently than data for the last 30 days and once again for data from some number of years ago. In Druid, we should give data owners the ability to use reindexing supervisors (AKA compaction supervisors) to change the state of a single datasource as the data ages. Operators should have the ability to define a data lifecycle of sorts for their datasources and allow reindexing supervisors to dynamically apply that definition to the underlying segments as they age. A great, and simple, example is query granularity.
Design
CompactionConfigBasedJobTemplate underpins all of this
The existing
CompactionConfigBasedJobTemplateunderpins all the new functionality. At the end of the day, the cascading reindexing template is creating these objects under the hood after dynamically coming up with the config. It uses the existing functionality in this template to be able to take that config, find segments that need reindexing, and create jobs for them.ReindexingConfigFinalizer
This is a concept introduced to allow us to optimize the final config used to create the underlying tasks without leaking rule implementation details into this existing job template. It is the mechanism that we use to optimize the set of filter rules that are needed for an underlying CompactionConfig before jobs are created for the candidate.
Reindexing Rule
A reindexing rule defines what should be done to segments being reindexed. There are rule implementations for all components of the existing
CompactionStateconstruct.rule period
All rules have an associated Joda time period (i.e.
P7D --> 7 days) This is used to define when a rule should begin being applied. The core idea is that this is a time in the past relative to "now" and now being the point in time that the reindexing supervisor runs to create tasks to reindex the underlying data.additive vs non additive rules
Some rule types are additive in that it logically makes sense to apply N of a single type to a single segment being reinedex. And then there are others where that is either illogical or physically impossible.
For example, a granularity rule to set Segment Granularity. Druid cannot have a segment with multiple segment granularities, therefore such a rule must be non-additive and reindex tasks where multiple granularity rules technically apply, only one can be selected.
To the contrary, a filter rule is logically additive. It makes sense that an operator my want to filter out rows matching
dim=foofor data older than 30 days anddata=barfor data older than 90 days. For data older than 90 days, we don't want to just filter outdim=bar, but rather we want to filter outdim=foo OR dim=bar. Thus filter rules are additiveFilter Rules
Some explicit structure is being added around the filtering that currently exists in the current compaction config transform spec. The driving force behind this design decision is that we do not want to apply filter rules to a
CompactionCandidatethat have already been applied to all segments in the candidate. To achieve this we need a deterministic pattern. That pattern is that each filter rule is a part of a NOT(X OR Y OR Z...) transformspec filter in the underlying reindeixing task where X, Y, and Z would be individual rules. Doing this allows us to easily identify and specify only unapplied rules for the tasks were are creating.A conceptual example using inline rule syntax:
The following rule would remove all rows that have the dimension
isRobotmatchingtrue{ "id": "no-robots", "period": "P6M", "filter": { "type": "selector", "dimension": "isRobot", "value": "true" } }Non-additive rule selection
For non-additive rule types, the rule provider implementation defines how rules are selected if an interval matches more than one. For the inline provider, we select the rule that is "older" as in
P7DvsP1Mwe selectP1M, and so on.Reindexing Rule Provider
Rule providers are what supply defined rules to the reindexing supervisor at runtime. The reindexing supervisor collects the collection of applicable rules for
CompactionCandidates and createsDataSourceCompactionConfigconfigurations that are fed into the existingCompactionConfigBasedJobTemplateto generate underlying Druid tasks to reindex data.This PR adds the
ReindexingRuleProviderInterface as well as a basic inline rule provider that can be used to define period based reindexing rules in the reindexing supervisor spec itself. Also provided in this PR is a composing rule provider that can be used to chain rule providers. This rule provider concept is meant to be easily extensible. It is possible (likely?) that core druid will supply more robust rule providers in the future. It is also reasonable to assume that community extensions can and will be created to add rule providers that extend the capabilities of core druid reindexing.Inline Rule Provider
Below is a stripped down supervisor spec that demonstrates the spirit of the inline rule provider. It is built against the standard wikipedia schema used across Druid for tutorials and testing.
Highlights:
isRobot=true{ "type": "autocompact", "spec": { "type": "reindexCascade", "dataSource": "wikipedia", "ruleProvider": { "type": "inline", "reindexingFilterRules": [ { "id": "no-robots", "period": "P6M", "filter": { "type": "selector", "dimension": "isRobot", "value": "true" } } ], "reindexingMetricsRules": [ ... ], "reindexingDimensionsRules": [ ... ], "reindexingIOConfigRules": [], "reindexingProjectionRules": [ ... ], "reindexingGranularityRules": [ { "id": "day", "description": null, "period": "P7D", "granularityConfig": { "segmentGranularity": "DAY", "queryGranularity": "MINUTE", "rollup": true } }, { "id": "month", "description": null, "period": "P1M", "granularityConfig": { "segmentGranularity": "MONTH", "queryGranularity": "HOUR", "rollup": true } } ], "reindexingTuningConfigRules": [ { "id": "tuning", "description": "testing tuning rule", "period": "P7D", "tuningConfig": { ... "partitionsSpec": { "type": "range", "targetRowsPerSegment": null, "maxRowsPerSegment": 10000000, "partitionDimensions": [ "countryName" ], "assumeGrouped": false }, ... } } ] } }, "suspended": false }Miscellaneous Notes
Supervisor Only Support
As we did in #18844, we only support this new functionality for reindexing supervisors (aka compaction supervisors) that run on the overlord. This is a conscious choice because we are moving Druid away from the legacy compaction duty for automatic compaction, in favor of these supervisors.
Follow Ups
Release note
Key changed/added classes in this PR
CascadingReindexingTemplateReindexingConfigFinalizerReindexingRuleProviderInlineReindexingRuleProviderComposingReindexingRuleProviderReindexingRule+AbstractReindexingRuleReindexingDimensionsRuleReindexingMetricsRuleReindexingGranularityRuleReindexingIOConfigRuleReindexingProjectionRuleReindexingFilterRuleThis PR has: