Skip to content

chore: enable Corr#3892

Open
kazuyukitanimura wants to merge 4 commits intoapache:mainfrom
kazuyukitanimura:fix-2646
Open

chore: enable Corr#3892
kazuyukitanimura wants to merge 4 commits intoapache:mainfrom
kazuyukitanimura:fix-2646

Conversation

@kazuyukitanimura
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura commented Apr 3, 2026

Which issue does this PR close?

Closes #2646

Rationale for this change

This is a workaround for the behavior for #2646 (comment)

What changes are included in this PR?

When both inputs to Corr are NaN, return Null

How are these changes tested?

Added tests

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review April 3, 2026 18:02
@comphead
Copy link
Copy Markdown
Contributor

what exactly the query that failed in spark? I checked DF corr and PGQL corr works the same.

> CREATE TABLE test_corr_nan(x double, y double, grp string);
0 row(s) fetched. 
Elapsed 0.025 seconds.

> INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');
+-------+
| count |
+-------+
| 8     |
+-------+
1 row(s) fetched. 
Elapsed 0.016 seconds.

> SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;
+-----------+---------------------------------------+
| grp       | corr(test_corr_nan.x,test_corr_nan.y) |
+-----------+---------------------------------------+
| both_nan  | NaN                                   |
| both_null | NULL                                  |
| nan_null  | NULL                                  |
| nan_val   | NULL                                  |
| null_nan  | NULL                                  |
| null_val  | NULL                                  |
| val_nan   | NULL                                  |
| val_null  | NULL                                  |
+-----------+---------------------------------------+
8 row(s) fetched. 
Elapsed 0.036 seconds.

PGSQL

CREATE TABLE test_corr_nan(x float, y float, grp varchar);

INSERT INTO test_corr_nan VALUES (
cast('NaN' as float), cast('NaN' as float), 'both_nan'), (
cast('NaN' as float), 1.0, 'nan_val'), 
(1.0, cast('NaN' as float), 'val_nan'), 
(NULL, cast('NaN' as float), 'null_nan'), 
(cast('NaN' as float), NULL, 'nan_null'), 
(NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');


SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;

    grp    | corr 
-----------+------
 both_nan  |  NaN
 both_null |     
 nan_null  |     
 nan_val   |     
 null_nan  |     
 null_val  |     
 val_nan   |     
 val_null  |     
(8 rows)

@kazuyukitanimura
Copy link
Copy Markdown
Contributor Author

what exactly the query that failed in spark?

Thanks @comphead Just to double check, you haven't enabled Comet for Spark?
The original issue is
#2646 (comment)

@parthchandra parthchandra changed the title chor: enable Corr chore: enable Corr Apr 11, 2026
@comphead
Copy link
Copy Markdown
Contributor

I have some feeling Comet is not using DF based corr and uses its own implementation

impl AggregateUDFImpl for Correlation 

More correct behavior is to delegate this to DF like for count

AggregateExprBuilder::new(count_udaf(), children)

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazuyukitanimura I think we need to try to delegate corr to DF correlation::corr_udaf() and remove Comet old implementation

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Suggestions are non-blocking

CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet

statement
INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a group with mixed nan and valid rows ( eg [(NaN, NaN), (1.0, 2.0), (3.0, 4.0)] )


object CometCorr extends CometAggregateExpressionSerde[Corr] {

override def getSupportLevel(expr: Corr): SupportLevel =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude flagged some edge cases we can document -

 ▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this workaround, Comet would return null instead (because the NaN row gets skipped → n=0). Should we add a getSupportLevel guard that returns Incompatible when
  corr.nullOnDivideByZero is false? Or at least document this?
 ▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while this workaround would skip the NaN row and compute a valid correlation over the remaining rows. Is that a known limitation we're OK with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fuzz test failure: corr null vs Nan

3 participants