HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST #6293

thomasrebele · 2026-02-04T00:01:44Z

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

…cates with a CAST

sonarqubecloud · 2026-02-04T00:58:22Z

Quality Gate passed

Issues
15 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

thomasrebele · 2026-02-04T08:49:26Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    if (range == null)
+      return cast;
+    if (range.minValue == null || Double.isNaN(range.minValue.doubleValue()))
+      return cast;
+    if (range.maxValue == null || Double.isNaN(range.maxValue.doubleValue()))
+      return cast;


I'll fix the 'if' construct must use '{}'s checkstyle warnings when updating the commit.

thomasrebele · 2026-02-04T08:50:09Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    float[] boundaries = new float[] { Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY };
+    boolean[] inclusive = new boolean[] { true, true };


I'll fix the '{' is followed by whitespace warnings when updating the PR.

thomasrebele · 2026-02-04T08:51:58Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    useFieldWithValues("f_numeric", VALUES2, KLL2);
+    float total = VALUES2.length;
+
+    {


Checkstyle warns about nested blocks, I'll refactor this when updating the PR.

zabetak

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

IS NOT NULL(CAST($1):BIGINT)
=(CAST($1):DOUBLE, 1)
IN(CAST($1):TINYINT, 10, 20, 30)

zabetak · 2026-02-12T11:35:57Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    int inputRefOpIndex = 1 - literalOpIdx;
+    RexNode node = operands.get(inputRefOpIndex);
+    if (node.getKind().equals(SqlKind.CAST)) {
+      node = removeCastIfPossible((RexCall) node, scan, boundaries, inclusive);


Even when the CAST is not removed the boundaries may change is this intentional?

zabetak · 2026-02-12T11:53:59Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+  @Test
+  public void testComputeRangePredicateSelectivityWithCast() {
+    useFieldWithValues("f_numeric", VALUES, KLL);
+    checkSelectivity(3 / 13.f, castAndCompare(TINYINT, GE, int5));


The test would be easier to read if expression was created as:

ge(cast("f_numeric", TINYINT), 5)

zabetak · 2026-02-12T12:00:50Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
+    RexCall call = (RexCall) filter;
+    SqlOperator operator = ((RexCall) filter).getOperator();
+    SqlOperator swappedOp;
+    if (operator == LE) {
+      swappedOp = GE;
+    } else if (operator == LT) {
+      swappedOp = GT;
+    } else if (operator == GE) {
+      swappedOp = LE;
+    } else if (operator == GT) {
+      swappedOp = LT;
+    } else if (operator == BETWEEN) {
+      // BETWEEN cannot be swapped
+      return;
+    } else {
+      throw new UnsupportedOperationException();
+    }
+    RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
+    Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
+  }


What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.

zabetak · 2026-02-12T12:08:43Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+
+  public static final RelDataType TINYINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.TINYINT);
+  public static final RelDataType INTEGER = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.INTEGER);
+  public static final RelDataType BIGINT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.BIGINT);
+  public static final RelDataType DECIMAL_2_1 = createDecimalType(2, 1);
+  public static final RelDataType DECIMAL_3_1 = createDecimalType(3, 1);
+  public static final RelDataType DECIMAL_4_1 = createDecimalType(4, 1);
+  public static final RelDataType DECIMAL_7_1 = createDecimalType(7, 1);
+  public static final RelDataType DECIMAL_38_25 = createDecimalType(38, 25);
+  public static final RelDataType FLOAT = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.FLOAT);
+  public static final RelDataType DOUBLE = REX_BUILDER.getTypeFactory().createSqlType(SqlTypeName.DOUBLE);


Types are not that expensive to create so we could simply inline the calls where necessary. Moreover, there is no point in making them public cause it wouldn't be a good idea to let other classes/tests use them; it would create unnecessary coupling and poor encapsulation without obvious benefit.

In terms of brevity we could rename the methods:

createDecimalType(2, 1) decimal(2, 1)

zabetak · 2026-02-12T12:58:10Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param val2 upper bound (exclusive)
+   * @return the selectivity of "val1 &lt;= column &lt; val2"
+   */
+  public static double rangedSelectivity(KllFloatsSketch kll, float val1, float val2) {


Do we need to increase visibility to public?

zabetak · 2026-02-12T13:04:12Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    // rawSelectivity does not account for null values, we multiply for the number of non-null values (getN)
+    // and we divide by the total (non-null + null values) to get the overall rawSelectivity.
+    //
+    // Example: consider a filter "col < 3", and the following table rows:
+    //  _____
+    // | col |
+    // |_____|
+    // |1    |
+    // |null |
+    // |null |
+    // |3    |
+    // |4    |
+    // -------
+    // kll.getN() would be 3, rawSelectivity 1/3, scan.getTable().getRowCount() 5
+    // so the final result would be 3 * 1/3 / 5 = 1/5, as expected.
+    return kll.getN() * rawSelectivity / scan.getTable().getRowCount();


Since we perform this computation multiple times maybe its worth putting it in private method and put all documentation/examples there instead of copying the comments or putting links to other methods.

zabetak · 2026-02-12T13:21:40Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param inclusive whether the respective boundary is inclusive or exclusive.
+   * @return the operand if the cast can be removed, otherwise the cast itself
+   */
+  private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {


The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.

Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?

zabetak · 2026-02-12T13:25:50Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

zabetak · 2026-02-12T13:26:28Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])}
+   * for an explanation of the parameters.
+   */
+  private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) {


Not reviewed yet this part till we decide if we are going to include this part or not.

zabetak · 2026-02-12T13:31:11Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
+   *                   indexes 2 and 3, if they exist, will be set to the boundaries of the type range
+   * @param inclusive whether the respective boundary is inclusive or exclusive.


If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.

HIVE-29424: CBO plans should use histogram statistics for range predi…

f80c231

…cates with a CAST

asf-ci-hive added the tests pending label Feb 4, 2026

asf-ci-hive added tests passed and removed tests pending labels Feb 4, 2026

thomasrebele commented Feb 4, 2026

View reviewed changes

thomasrebele marked this pull request as ready for review February 4, 2026 08:53

zabetak reviewed Feb 12, 2026

View reviewed changes

		float[] boundaries = new float[] { Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY };
		boolean[] inclusive = new boolean[] { true, true };

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST #6293

Are you sure you want to change the base?

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST #6293

Conversation

thomasrebele commented Feb 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Feb 4, 2026

Quality Gate passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants