[SPARK-57038][SQL] Use `Expression.references` in SPJ planning by peter-toth · Pull Request #56088 · apache/spark

peter-toth · 2026-05-24T11:34:05Z

[SPARK-57038][SQL] Use Expression.references in SPJ planning

What changes were proposed in this pull request?

Document AttributeSet's iteration-order contract on the class scaladoc: iteration via iterator / foreach / flatMap returns elements in insertion order (driven by the underlying LinkedHashSet). toSeq is called out as the explicit exception — it sorts by (name, exprId.id) for codegen stability (SPARK-18394).
Migrate seven SPJ-related uses of _.collectLeaves() in partitioning.scala and EnsureRequirements.scala to _.references / AttributeSet.fromAttributeSets(...). Drops the now-redundant .map(_.asInstanceOf[Attribute]) cast at the EnsureRequirements:89 site.
Update EnsureRequirementsSuite synthetic fixtures (exprA..D) from Literal(1..4) to AttributeReferences. The literals were stand-ins for columns; under the migration's _.references-based attribute extraction, literal children produce empty AttributeSets and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms because KeyedPartitioning.supportsExpressions's isReference check filters them out — so this is a fixture-only update that preserves test intent.

Why are the changes needed?

TreeNode.collectLeaves() returns every node in the tree where children.isEmpty, including Literals. SPJ planning has always wanted attributes only, but with the existing partition expression layout (TransformExpression.children = [col], parameters carried in a sidecar numBucketsOpt: Option[Int] field), the difference didn't surface.

Follow-up work (e.g. SPARK-50593 / #55885) that puts literal parameters directly into TransformExpression.children (bucket(Literal(numBuckets), col), truncate(col, Literal(width))) would otherwise force TransformExpression to override collectLeaves to filter literals, breaking the universal TreeNode.collectLeaves contract for one expression type.

Expression.references already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (RowOrdering.create, reorder, attributes.zip(clustering)) requires. The per-partition-expression single-column rule (enforced by KeyedPartitioning.supportsExpressions) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Covered by existing test suites that exercise the migrated call sites: KeyGroupedPartitioningSuite, EnsureRequirementsSuite, ProjectedOrderingAndPartitioningSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

dongjoon-hyun

+1, LGTM.

peter-toth · 2026-05-25T10:28:41Z

Thanks @dongjoon-hyun for the review. I've iterated on the PR as I think we can document the "iteration in insertion order" property of AttributeSet and so we don't need the new helper.

cc @sunchao, @szehon-ho

### What changes were proposed in this pull request? - Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394). - Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site. - Update `EnsureRequirementsSuite` synthetic fixtures (`exprA..D`) from `Literal(1..4)` to `AttributeReference`s. The literals were stand-ins for columns; under the migration's `_.references`-based attribute extraction, literal children produce empty `AttributeSet`s and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms because `KeyedPartitioning.supportsExpressions`'s `isReference` check filters them out — so this is a fixture-only update that preserves test intent. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface. Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. `Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

peter-toth mentioned this pull request May 24, 2026

[SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API #55885

Open

peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from 6772c12 to ae2c119 Compare May 24, 2026 14:15

dongjoon-hyun approved these changes May 24, 2026

View reviewed changes

peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from ae2c119 to cdbcc6a Compare May 25, 2026 10:18

peter-toth changed the title ~~[SPARK-57038][SQL] Use references/collectAttributes in SPJ planning~~ [SPARK-57038][SQL] Use Expression.references in SPJ planning May 25, 2026

peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from cdbcc6a to a63baef Compare May 25, 2026 10:31

peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from a63baef to 359a1a0 Compare May 25, 2026 14:46

dongjoon-hyun approved these changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57038][SQL] Use `Expression.references` in SPJ planning#56088

[SPARK-57038][SQL] Use `Expression.references` in SPJ planning#56088
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:SPARK-57038-revisit-collectleaves

peter-toth commented May 24, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

peter-toth commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter-toth commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-toth commented May 24, 2026 •

edited

Loading

peter-toth commented May 25, 2026 •

edited

Loading