Skip to content

Conversation

@thomasrebele
Copy link
Contributor

See HIVE-29424.

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 4, 2026

Comment on lines +228 to +233
if (range == null)
return cast;
if (range.minValue == null || Double.isNaN(range.minValue.doubleValue()))
return cast;
if (range.maxValue == null || Double.isNaN(range.maxValue.doubleValue()))
return cast;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the 'if' construct must use '{}'s checkstyle warnings when updating the commit.

Comment on lines +344 to +345
float[] boundaries = new float[] { Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY };
boolean[] inclusive = new boolean[] { true, true };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the '{' is followed by whitespace warnings when updating the PR.

useFieldWithValues("f_numeric", VALUES2, KLL2);
float total = VALUES2.length;

{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkstyle warns about nested blocks, I'll refactor this when updating the PR.

@thomasrebele thomasrebele marked this pull request as ready for review February 4, 2026 08:53
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

  • IS NOT NULL(CAST($1):BIGINT)
  • =(CAST($1):DOUBLE, 1)
  • IN(CAST($1):TINYINT, 10, 20, 30)

int inputRefOpIndex = 1 - literalOpIdx;
RexNode node = operands.get(inputRefOpIndex);
if (node.getKind().equals(SqlKind.CAST)) {
node = removeCastIfPossible((RexCall) node, scan, boundaries, inclusive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when the CAST is not removed the boundaries may change is this intentional?

@Test
public void testComputeRangePredicateSelectivityWithCast() {
useFieldWithValues("f_numeric", VALUES, KLL);
checkSelectivity(3 / 13.f, castAndCompare(TINYINT, GE, int5));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test would be easier to read if expression was created as:

ge(cast("f_numeric", TINYINT), 5)

Comment on lines +736 to +756
// swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
RexCall call = (RexCall) filter;
SqlOperator operator = ((RexCall) filter).getOperator();
SqlOperator swappedOp;
if (operator == LE) {
swappedOp = GE;
} else if (operator == LT) {
swappedOp = GT;
} else if (operator == GE) {
swappedOp = LE;
} else if (operator == GT) {
swappedOp = LT;
} else if (operator == BETWEEN) {
// BETWEEN cannot be swapped
return;
} else {
throw new UnsupportedOperationException();
}
RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.

Comment on lines +119 to +129

public static final RelDataType TINYINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.TINYINT);
public static final RelDataType INTEGER = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.INTEGER);
public static final RelDataType BIGINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.BIGINT);
public static final RelDataType DECIMAL_2_1 = createDecimalType(2, 1);
public static final RelDataType DECIMAL_3_1 = createDecimalType(3, 1);
public static final RelDataType DECIMAL_4_1 = createDecimalType(4, 1);
public static final RelDataType DECIMAL_7_1 = createDecimalType(7, 1);
public static final RelDataType DECIMAL_38_25 = createDecimalType(38, 25);
public static final RelDataType FLOAT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.FLOAT);
public static final RelDataType DOUBLE = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.DOUBLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types are not that expensive to create so we could simply inline the calls where necessary. Moreover, there is no point in making them public cause it wouldn't be a good idea to let other classes/tests use them; it would create unnecessary coupling and poor encapsulation without obvious benefit.

In terms of brevity we could rename the methods:

createDecimalType(2, 1)
decimal(2, 1)

* @param val2 upper bound (exclusive)
* @return the selectivity of "val1 &lt;= column &lt; val2"
*/
public static double rangedSelectivity(KllFloatsSketch kll, float val1, float val2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to increase visibility to public?

Comment on lines +378 to +393
// rawSelectivity does not account for null values, we multiply for the number of non-null values (getN)
// and we divide by the total (non-null + null values) to get the overall rawSelectivity.
//
// Example: consider a filter "col < 3", and the following table rows:
// _____
// | col |
// |_____|
// |1 |
// |null |
// |null |
// |3 |
// |4 |
// -------
// kll.getN() would be 3, rawSelectivity 1/3, scan.getTable().getRowCount() 5
// so the final result would be 3 * 1/3 / 5 = 1/5, as expected.
return kll.getN() * rawSelectivity / scan.getTable().getRowCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we perform this computation multiple times maybe its worth putting it in private method and put all documentation/examples there instead of copying the comments or putting links to other methods.

* @param inclusive whether the respective boundary is inclusive or exclusive.
* @return the operand if the cast can be removed, otherwise the cast itself
*/
private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.

Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

* See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])}
* for an explanation of the parameters.
*/
private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not reviewed yet this part till we decide if we are going to include this part or not.

Comment on lines +208 to +210
* @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
* indexes 2 and 3, if they exist, will be set to the boundaries of the type range
* @param inclusive whether the respective boundary is inclusive or exclusive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants