[SPARK-55411][SQL][4.0] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys by pan3793 · Pull Request #54260 · apache/spark

pan3793 · 2026-02-11T02:37:44Z

Backport #54182 to branch-4.0

What changes were proposed in this pull request?

Fix a java.lang.ArrayIndexOutOfBoundsException when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting the expression(should pass the full partition expression instead of the projected one) passed to KeyGroupedPartitioning#project.

Also, fix a test code issue, change the calculation result of BucketTransform defined at InMemoryBaseTable.scala to match BucketFunctions defined at transformFunctions.scala (thanks peter-toth for pointing this out!)

Why are the changes needed?

It's a bug fix.

Does this PR introduce any user-facing change?

Some queries that failed when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true now run normally.

How was this patch tested?

New UT is added, previously it failed with ArrayIndexOutOfBoundsException, now passed.

$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...

UTs affected by bucket() calculate logic change are tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2026-02-11T05:00:53Z

Python UDF failures are likely irrelevant, try to fix it by #54263

…when join keys are less than cluster keys Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) It's a bug fix. Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. No. Closes apache#54182 from pan3793/spj-subset-joinkey-bug. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>

peter-toth

LGTM, pending CI.

…when join keys are less than cluster keys Backport #54182 to branch-4.0 ### What changes were proposed in this pull request? Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) ### Why are the changes needed? It's a bug fix. ### Does this PR introduce _any_ user-facing change? Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. ### How was this patch tested? New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54260 from pan3793/SPARK-55411-4.0. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>

peter-toth · 2026-02-11T13:57:21Z

Thank you @pan3793 and @szehon-ho.

Merged to branch-4.0 (4.0.3)

pan3793 mentioned this pull request Feb 11, 2026

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys #54182

Closed

pan3793 force-pushed the SPARK-55411-4.0 branch from fad1523 to c1996bb Compare February 11, 2026 06:56

szehon-ho approved these changes Feb 11, 2026

View reviewed changes

peter-toth approved these changes Feb 11, 2026

View reviewed changes

peter-toth closed this Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55411][SQL][4.0] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54260

[SPARK-55411][SQL][4.0] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54260
pan3793 wants to merge 1 commit intoapache:branch-4.0from
pan3793:SPARK-55411-4.0

pan3793 commented Feb 11, 2026

Uh oh!

pan3793 commented Feb 11, 2026

Uh oh!

peter-toth left a comment

Uh oh!

peter-toth commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pan3793 commented Feb 11, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Feb 11, 2026

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants