Skip to content

fix(isthmus): std_dev, variance function mappings#780

Open
nielspardon wants to merge 1 commit intosubstrait-io:mainfrom
nielspardon:par-stddev
Open

fix(isthmus): std_dev, variance function mappings#780
nielspardon wants to merge 1 commit intosubstrait-io:mainfrom
nielspardon:par-stddev

Conversation

@nielspardon
Copy link
Copy Markdown
Member

@nielspardon nielspardon commented Mar 26, 2026

Disclaimers upfront: I have been assisted by AI generating parts of the code, documentation and PR description.

This PR still uses the to be deprecated function signatures using options. I will migrate the code to enum arguments in a follow-up.

Since this PR also casts the input arguments to FP64 if necessary, we do not need the integer based function signatures proposed in substrait-io/substrait#1012


This PR fixes the bidirectional conversion between Calcite and Substrait for statistical aggregate functions (standard deviation and variance). Previously, these functions were not properly mapped, preventing SQL queries using STDDEV_POP, STDDEV_SAMP, VAR_POP, and VAR_SAMP from being correctly converted to Substrait and back.

Problem

Substrait uses a single function name for both population and sample variants of statistical functions:

  • std_dev for both STDDEV_POP and STDDEV_SAMP
  • variance for both VAR_POP and VAR_SAMP

The distinction between population (n denominator) and sample (n-1 denominator) variants is made via a distribution function option (POPULATION or SAMPLE), but this mechanism was not implemented in the Isthmus converter. The tests did pass prior to this change since the existing TPC-DS testcases using these functions were incorrectly mapped to the AVG function since Calcite maps these statistical functions to SqlAvgAggFunction.

Solution

Calcite → Substrait Direction:

  • Added function mappings in FunctionMappings for all four statistical operators
  • Enhanced AggregateFunctionConverter.generateBinding() to automatically inject the distribution option based on the Calcite SqlKind:
    • STDDEV_SAMP, VAR_SAMPdistribution=SAMPLE
    • STDDEV_POP, VAR_POPdistribution=POPULATION

Substrait → Calcite Direction:

  • Extended FunctionConverter.getSqlOperatorFromSubstraitFunc() to accept function options as a parameter
  • Implemented filterByFunctionOptions() to disambiguate operators using the distribution option
  • Updated SubstraitRelVisitor to pass function options during conversion

DSL Support:

  • Added convenience methods in SubstraitBuilder: stddevPopulation(), stddevSample(), variancePopulation(), varianceSample()
  • These methods automatically configure the appropriate distribution option

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant