feat: Add multi-column support for null-aware anti joins#19857
feat: Add multi-column support for null-aware anti joins#19857viirya wants to merge 4 commits intoapache:mainfrom
Conversation
f61eabb to
f6db769
Compare
d07982f to
5662114
Compare
|
Thanks @viirya I'll check it soon, afaik there is no TPC* like queries that cover multiple column null aware anti joins, so it would be prob nice to have a bench in future to make sure no performance regression introduced with future PRs |
This commit extends null-aware anti join functionality to support
multiple columns, enabling queries like:
SELECT * FROM t1 WHERE (a, b) NOT IN (SELECT x, y FROM t2);
and correlated multi-column NOT IN subqueries:
SELECT * FROM t1 WHERE (c2, c3) NOT IN (
SELECT c2, c3 FROM t2 WHERE t1.c1 = t2.c1
);
Changes:
Physical Execution Layer:
- Remove single-column validation restriction in HashJoinExec
- Extend NULL detection in probe phase to check ANY column for NULLs
- Extend NULL filtering in final phase to filter rows with ANY NULL column
- Add comprehensive unit tests for 2-column and 3-column joins
SQL Planning Layer:
- Allow tuple expressions in parse_in_subquery()
- Add validation for tuple field count matching
Query Optimization Layer:
- Update InSubquery validation to allow struct expressions
- Skip type coercion for struct expressions (handled in decorrelation)
- Implement struct decomposition in decorrelate_predicate_subquery
- Decompose struct(a, b) into individual join conditions a = x AND b = y
- Handle both correlated and non-correlated multi-column subqueries
Test Coverage:
- Add 7 new SQL logic test cases (Tests 19-25)
- Add 3 unit test functions with 15 test variants (5 batch sizes each)
- Cover 2-column, 3-column, empty subquery, and NULL patterns
- Include correlated multi-column NOT IN from issue apache#10583
Test Results:
- 31/31 null-aware anti join tests passing
- 369/369 total hash join tests passing
- All optimizer tests passing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add test coverage for multi-column IN subqueries to verify that the struct expression support works correctly for both negated (NOT IN) and non-negated (IN) cases. Tests added to subquery.slt: - Test 1: Basic two-column IN - Test 2: Multi-column IN with no matches - Test 3: Multi-column IN with NULL values (verifies non-null-aware behavior) - Test 4: Three-column IN - Test 5: Correlated multi-column IN - Test 6: Verify logical plan shows LeftSemi with multiple join conditions - Test 7: Multi-column IN with empty subquery - Test 8: Multi-column IN with WHERE clause in subquery These tests complement the multi-column NOT IN tests in null_aware_anti_join.slt and verify that struct decomposition (converting `(a, b) IN (SELECT x, y ...)` into `a = x AND b = y`) works correctly for LeftSemi joins. Key differences from NOT IN: - IN uses LeftSemi join (not null-aware) - IN does not use CollectLeft partition mode - NULL values don't match in regular semi joins (two-valued logic) Related to multi-column null-aware anti join implementation.
- Collapse nested if statement in invariants.rs (clippy::collapsible_if) - Collapse nested if statement in hash_join/exec.rs (clippy::collapsible_if) - Use unwrap_or_else instead of unwrap_or for function calls in decorrelate_predicate_subquery.rs (clippy::or_fun_call)
5662114 to
6613da5
Compare
|
run benchmark tpcds |
|
🤖 Criterion benchmark running (GKE) | trigger File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
run benchmark tpcds |
|
Not sure if tpcds contains NA Anti Join, but it at least contains many join types |
|
🤖 Benchmark running (GKE) | trigger File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Upstream changed EXPLAIN to show both logical and physical plans by default. Update the multi-column IN test to match the new output format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
This commit extends null-aware anti join functionality to support multiple columns, enabling queries like:
and correlated multi-column NOT IN subqueries:
Changes:
Physical Execution Layer:
SQL Planning Layer:
Query Optimization Layer:
Test Coverage:
struct expression support works correctly for both negated (NOT IN)
and non-negated (IN) cases.
Are these changes tested?
Are there any user-facing changes?