[SPARK-57038][SQL] Use Expression.references in SPJ planning#56088
Open
peter-toth wants to merge 1 commit into
Open
[SPARK-57038][SQL] Use Expression.references in SPJ planning#56088peter-toth wants to merge 1 commit into
Expression.references in SPJ planning#56088peter-toth wants to merge 1 commit into
Conversation
6772c12 to
ae2c119
Compare
ae2c119 to
cdbcc6a
Compare
Expression.references in SPJ planning
Contributor
Author
|
Thanks @dongjoon-hyun for the review. I've iterated on the PR as I think we can document the "iteration in insertion order" property of cc @sunchao, @szehon-ho |
cdbcc6a to
a63baef
Compare
### What changes were proposed in this pull request? - Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394). - Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site. - Update `EnsureRequirementsSuite` synthetic fixtures (`exprA..D`) from `Literal(1..4)` to `AttributeReference`s. The literals were stand-ins for columns; under the migration's `_.references`-based attribute extraction, literal children produce empty `AttributeSet`s and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms because `KeyedPartitioning.supportsExpressions`'s `isReference` check filters them out — so this is a fixture-only update that preserves test intent. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface. Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. `Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
a63baef to
359a1a0
Compare
dongjoon-hyun
approved these changes
May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[SPARK-57038][SQL] Use
Expression.referencesin SPJ planningWhat changes were proposed in this pull request?
AttributeSet's iteration-order contract on the class scaladoc: iteration viaiterator/foreach/flatMapreturns elements in insertion order (driven by the underlyingLinkedHashSet).toSeqis called out as the explicit exception — it sorts by(name, exprId.id)for codegen stability (SPARK-18394)._.collectLeaves()inpartitioning.scalaandEnsureRequirements.scalato_.references/AttributeSet.fromAttributeSets(...). Drops the now-redundant.map(_.asInstanceOf[Attribute])cast at theEnsureRequirements:89site.EnsureRequirementsSuitesynthetic fixtures (exprA..D) fromLiteral(1..4)toAttributeReferences. The literals were stand-ins for columns; under the migration's_.references-based attribute extraction, literal children produce emptyAttributeSets and trip the planner's "exactly one attribute per partition expression" assertions. Real partitionings can't reach those assertions with literal-only transforms becauseKeyedPartitioning.supportsExpressions'sisReferencecheck filters them out — so this is a fixture-only update that preserves test intent.Why are the changes needed?
TreeNode.collectLeaves()returns every node in the tree wherechildren.isEmpty, includingLiterals. SPJ planning has always wanted attributes only, but with the existing partition expression layout (TransformExpression.children = [col], parameters carried in a sidecarnumBucketsOpt: Option[Int]field), the difference didn't surface.Follow-up work (e.g. SPARK-50593 / #55885) that puts literal parameters directly into
TransformExpression.children(bucket(Literal(numBuckets), col),truncate(col, Literal(width))) would otherwise forceTransformExpressionto overridecollectLeavesto filter literals, breaking the universalTreeNode.collectLeavescontract for one expression type.Expression.referencesalready returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (RowOrdering.create,reorder,attributes.zip(clustering)) requires. The per-partition-expression single-column rule (enforced byKeyedPartitioning.supportsExpressions) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Covered by existing test suites that exercise the migrated call sites:
KeyGroupedPartitioningSuite,EnsureRequirementsSuite,ProjectedOrderingAndPartitioningSuite.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7