chore: enable Corr by kazuyukitanimura · Pull Request #3892 · apache/datafusion-comet

kazuyukitanimura · 2026-04-03T00:13:13Z

Which issue does this PR close?

Closes #2646

Rationale for this change

This is a workaround for the behavior for #2646 (comment)

What changes are included in this PR?

When both inputs to Corr are NaN, return Null

How are these changes tested?

Added tests

comphead · 2026-04-10T23:53:57Z

what exactly the query that failed in spark? I checked DF corr and PGQL corr works the same.

> CREATE TABLE test_corr_nan(x double, y double, grp string);
0 row(s) fetched. 
Elapsed 0.025 seconds.

> INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');
+-------+
| count |
+-------+
| 8     |
+-------+
1 row(s) fetched. 
Elapsed 0.016 seconds.

> SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;
+-----------+---------------------------------------+
| grp       | corr(test_corr_nan.x,test_corr_nan.y) |
+-----------+---------------------------------------+
| both_nan  | NaN                                   |
| both_null | NULL                                  |
| nan_null  | NULL                                  |
| nan_val   | NULL                                  |
| null_nan  | NULL                                  |
| null_val  | NULL                                  |
| val_nan   | NULL                                  |
| val_null  | NULL                                  |
+-----------+---------------------------------------+
8 row(s) fetched. 
Elapsed 0.036 seconds.

PGSQL

CREATE TABLE test_corr_nan(x float, y float, grp varchar);

INSERT INTO test_corr_nan VALUES (
cast('NaN' as float), cast('NaN' as float), 'both_nan'), (
cast('NaN' as float), 1.0, 'nan_val'), 
(1.0, cast('NaN' as float), 'val_nan'), 
(NULL, cast('NaN' as float), 'null_nan'), 
(cast('NaN' as float), NULL, 'nan_null'), 
(NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');


SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;

    grp    | corr 
-----------+------
 both_nan  |  NaN
 both_null |     
 nan_null  |     
 nan_val   |     
 null_nan  |     
 null_val  |     
 val_nan   |     
 val_null  |     
(8 rows)

kazuyukitanimura · 2026-04-11T00:07:45Z

what exactly the query that failed in spark?

Thanks @comphead Just to double check, you haven't enabled Comet for Spark?
The original issue is
#2646 (comment)

comphead · 2026-04-11T00:22:23Z

I have some feeling Comet is not using DF based corr and uses its own implementation

impl AggregateUDFImpl for Correlation

More correct behavior is to delegate this to DF like for count

AggregateExprBuilder::new(count_udaf(), children)

comphead

Thanks @kazuyukitanimura I think we need to try to delegate corr to DF correlation::corr_udaf() and remove Comet old implementation

parthchandra

lgtm. Suggestions are non-blocking

parthchandra · 2026-04-11T00:36:47Z

spark/src/test/resources/sql-tests/expressions/aggregate/corr.sql

+CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet
+
+statement
+INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null')


Maybe add a group with mixed nan and valid rows ( eg [(NaN, NaN), (1.0, 2.0), (3.0, 4.0)] )

parthchandra · 2026-04-11T00:39:18Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala


 object CometCorr extends CometAggregateExpressionSerde[Corr] {
-
-  override def getSupportLevel(expr: Corr): SupportLevel =


Claude flagged some edge cases we can document -

▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this workaround, Comet would return null instead (because the NaN row gets skipped → n=0). Should we add a getSupportLevel guard that returns Incompatible when corr.nullOnDivideByZero is false? Or at least document this? ▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while this workaround would skip the NaN row and compute a valid correlation over the remaining rows. Is that a known limitation we're OK with?

kazuyukitanimura added 2 commits April 3, 2026 11:01

chor: enable Corr

be084dc

chor: enable Corr

bc0d4e4

kazuyukitanimura force-pushed the fix-2646 branch from 44a5522 to bc0d4e4 Compare April 3, 2026 18:02

kazuyukitanimura marked this pull request as ready for review April 3, 2026 18:02

kazuyukitanimura added 2 commits April 6, 2026 17:49

chor: enable Corr

775eedd

Merge remote-tracking branch 'upstream/main' into fix-2646

4d5ef3a

parthchandra changed the title ~~chor: enable Corr~~ chore: enable Corr Apr 11, 2026

comphead reviewed Apr 11, 2026

View reviewed changes

parthchandra approved these changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: enable Corr#3892

chore: enable Corr#3892
kazuyukitanimura wants to merge 4 commits intoapache:mainfrom
kazuyukitanimura:fix-2646

kazuyukitanimura commented Apr 3, 2026 •

edited

Loading

Uh oh!

comphead commented Apr 10, 2026

Uh oh!

kazuyukitanimura commented Apr 11, 2026

Uh oh!

comphead commented Apr 11, 2026

Uh oh!

comphead left a comment

Uh oh!

parthchandra left a comment

Uh oh!

parthchandra Apr 11, 2026

Uh oh!

parthchandra Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		object CometCorr extends CometAggregateExpressionSerde[Corr] {

		override def getSupportLevel(expr: Corr): SupportLevel =

Conversation

kazuyukitanimura commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Apr 10, 2026

Uh oh!

kazuyukitanimura commented Apr 11, 2026

Uh oh!

comphead commented Apr 11, 2026

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kazuyukitanimura commented Apr 3, 2026 •

edited

Loading