Skip to content

[SPARK-57038][SQL] Use Expression.references in SPJ planning#56088

Open
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:SPARK-57038-revisit-collectleaves
Open

[SPARK-57038][SQL] Use Expression.references in SPJ planning#56088
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:SPARK-57038-revisit-collectleaves

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth commented May 24, 2026

[SPARK-57038][SQL] Use Expression.references in SPJ planning

What changes were proposed in this pull request?

  • Document AttributeSet's iteration-order contract on the class scaladoc: iteration via iterator / foreach / flatMap returns elements in insertion order (driven by the underlying LinkedHashSet). toSeq is called out as the explicit exception — it sorts by (name, exprId.id) for codegen stability (SPARK-18394).
  • Migrate seven SPJ-related uses of _.collectLeaves() in partitioning.scala and EnsureRequirements.scala to _.references / AttributeSet.fromAttributeSets(...). Drops the now-redundant .map(_.asInstanceOf[Attribute]) cast at the EnsureRequirements:89 site.
  • Update EnsureRequirementsSuite synthetic fixtures (exprA..D) from Literal(1..4) to AttributeReferences. The literals were stand-ins for columns; under the migration's _.references-based attribute extraction, literal children produce empty AttributeSets and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms because KeyedPartitioning.supportsExpressions's isReference check filters them out — so this is a fixture-only update that preserves test intent.

Why are the changes needed?

TreeNode.collectLeaves() returns every node in the tree where children.isEmpty, including Literals. SPJ planning has always wanted attributes only, but with the existing partition expression layout (TransformExpression.children = [col], parameters carried in a sidecar numBucketsOpt: Option[Int] field), the difference didn't surface.

Follow-up work (e.g. SPARK-50593 / #55885) that puts literal parameters directly into TransformExpression.children (bucket(Literal(numBuckets), col), truncate(col, Literal(width))) would otherwise force TransformExpression to override collectLeaves to filter literals, breaking the universal TreeNode.collectLeaves contract for one expression type.

Expression.references already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (RowOrdering.create, reorder, attributes.zip(clustering)) requires. The per-partition-expression single-column rule (enforced by KeyedPartitioning.supportsExpressions) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Covered by existing test suites that exercise the migrated call sites: KeyGroupedPartitioningSuite, EnsureRequirementsSuite, ProjectedOrderingAndPartitioningSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@peter-toth peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from ae2c119 to cdbcc6a Compare May 25, 2026 10:18
@peter-toth peter-toth changed the title [SPARK-57038][SQL] Use references/collectAttributes in SPJ planning [SPARK-57038][SQL] Use Expression.references in SPJ planning May 25, 2026
@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented May 25, 2026

Thanks @dongjoon-hyun for the review. I've iterated on the PR as I think we can document the "iteration in insertion order" property of AttributeSet and so we don't need the new helper.

cc @sunchao, @szehon-ho

@peter-toth peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from cdbcc6a to a63baef Compare May 25, 2026 10:31
### What changes were proposed in this pull request?

- Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394).
- Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site.
- Update `EnsureRequirementsSuite` synthetic fixtures (`exprA..D`) from `Literal(1..4)` to `AttributeReference`s. The literals were stand-ins for columns; under the migration's `_.references`-based attribute extraction, literal children produce empty `AttributeSet`s and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms because `KeyedPartitioning.supportsExpressions`'s `isReference` check filters them out — so this is a fixture-only update that preserves test intent.

### Why are the changes needed?

`TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface.

Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type.

`Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7
@peter-toth peter-toth force-pushed the SPARK-57038-revisit-collectleaves branch from a63baef to 359a1a0 Compare May 25, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants