-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST #6293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…cates with a CAST
|
| if (range == null) | ||
| return cast; | ||
| if (range.minValue == null || Double.isNaN(range.minValue.doubleValue())) | ||
| return cast; | ||
| if (range.maxValue == null || Double.isNaN(range.maxValue.doubleValue())) | ||
| return cast; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix the 'if' construct must use '{}'s checkstyle warnings when updating the commit.
| float[] boundaries = new float[] { Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY }; | ||
| boolean[] inclusive = new boolean[] { true, true }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix the '{' is followed by whitespace warnings when updating the PR.
| useFieldWithValues("f_numeric", VALUES2, KLL2); | ||
| float total = VALUES2.length; | ||
|
|
||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkstyle warns about nested blocks, I'll refactor this when updating the PR.
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @thomasrebele , the proposal is very promising.
One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:
IS NOT NULL(CAST($1):BIGINT)=(CAST($1):DOUBLE, 1)IN(CAST($1):TINYINT, 10, 20, 30)
| int inputRefOpIndex = 1 - literalOpIdx; | ||
| RexNode node = operands.get(inputRefOpIndex); | ||
| if (node.getKind().equals(SqlKind.CAST)) { | ||
| node = removeCastIfPossible((RexCall) node, scan, boundaries, inclusive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when the CAST is not removed the boundaries may change is this intentional?
| @Test | ||
| public void testComputeRangePredicateSelectivityWithCast() { | ||
| useFieldWithValues("f_numeric", VALUES, KLL); | ||
| checkSelectivity(3 / 13.f, castAndCompare(TINYINT, GE, int5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test would be easier to read if expression was created as:
ge(cast("f_numeric", TINYINT), 5)| // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same | ||
| RexCall call = (RexCall) filter; | ||
| SqlOperator operator = ((RexCall) filter).getOperator(); | ||
| SqlOperator swappedOp; | ||
| if (operator == LE) { | ||
| swappedOp = GE; | ||
| } else if (operator == LT) { | ||
| swappedOp = GT; | ||
| } else if (operator == GE) { | ||
| swappedOp = LE; | ||
| } else if (operator == GT) { | ||
| swappedOp = LT; | ||
| } else if (operator == BETWEEN) { | ||
| // BETWEEN cannot be swapped | ||
| return; | ||
| } else { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0)); | ||
| Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.
|
|
||
| public static final RelDataType TINYINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.TINYINT); | ||
| public static final RelDataType INTEGER = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.INTEGER); | ||
| public static final RelDataType BIGINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.BIGINT); | ||
| public static final RelDataType DECIMAL_2_1 = createDecimalType(2, 1); | ||
| public static final RelDataType DECIMAL_3_1 = createDecimalType(3, 1); | ||
| public static final RelDataType DECIMAL_4_1 = createDecimalType(4, 1); | ||
| public static final RelDataType DECIMAL_7_1 = createDecimalType(7, 1); | ||
| public static final RelDataType DECIMAL_38_25 = createDecimalType(38, 25); | ||
| public static final RelDataType FLOAT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.FLOAT); | ||
| public static final RelDataType DOUBLE = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.DOUBLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types are not that expensive to create so we could simply inline the calls where necessary. Moreover, there is no point in making them public cause it wouldn't be a good idea to let other classes/tests use them; it would create unnecessary coupling and poor encapsulation without obvious benefit.
In terms of brevity we could rename the methods:
createDecimalType(2, 1)
decimal(2, 1)| * @param val2 upper bound (exclusive) | ||
| * @return the selectivity of "val1 <= column < val2" | ||
| */ | ||
| public static double rangedSelectivity(KllFloatsSketch kll, float val1, float val2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to increase visibility to public?
| // rawSelectivity does not account for null values, we multiply for the number of non-null values (getN) | ||
| // and we divide by the total (non-null + null values) to get the overall rawSelectivity. | ||
| // | ||
| // Example: consider a filter "col < 3", and the following table rows: | ||
| // _____ | ||
| // | col | | ||
| // |_____| | ||
| // |1 | | ||
| // |null | | ||
| // |null | | ||
| // |3 | | ||
| // |4 | | ||
| // ------- | ||
| // kll.getN() would be 3, rawSelectivity 1/3, scan.getTable().getRowCount() 5 | ||
| // so the final result would be 3 * 1/3 / 5 = 1/5, as expected. | ||
| return kll.getN() * rawSelectivity / scan.getTable().getRowCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we perform this computation multiple times maybe its worth putting it in private method and put all documentation/examples there instead of copying the comments or putting links to other methods.
| * @param inclusive whether the respective boundary is inclusive or exclusive. | ||
| * @return the operand if the cast can be removed, otherwise the cast itself | ||
| */ | ||
| private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.
Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?
In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.
| * See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])} | ||
| * for an explanation of the parameters. | ||
| */ | ||
| private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reviewed yet this part till we decide if we are going to include this part or not.
| * @param boundaries indexes 0 and 1 are the boundaries of the range predicate; | ||
| * indexes 2 and 3, if they exist, will be set to the boundaries of the type range | ||
| * @param inclusive whether the respective boundary is inclusive or exclusive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.



See HIVE-29424.
What changes were proposed in this pull request?
This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.
Why are the changes needed?
This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests were added.