Skip to content

[CORE] Preserve fallback tag for nodes without logicalLink in RemoveFallbackTagRule#12028

Merged
liujiayi771 merged 1 commit intoapache:mainfrom
liujiayi771:fallback-reason
May 9, 2026
Merged

[CORE] Preserve fallback tag for nodes without logicalLink in RemoveFallbackTagRule#12028
liujiayi771 merged 1 commit intoapache:mainfrom
liujiayi771:fallback-reason

Conversation

@liujiayi771
Copy link
Copy Markdown
Contributor

@liujiayi771 liujiayi771 commented May 3, 2026

Summary

For SparkPlan nodes that lack a logicalLink (e.g., SortExec / BroadcastExchange generated by EnsureRequirements), GlutenFallbackReporter cannot copy their FallbackTag onto a logical plan, so the subsequent RemoveFallbackTagRule removed their tag without persisting it anywhere. This caused the fallback reason to be lost when GlutenQueryExecutionListener re-walks the finalized plan at SQL execution end, resulting in a generic "Gluten does not touch it or does not support it" reason instead of the actual fallback reason.

In #11853, the test assertion for "get correct fallback reason on nodes without logicalLink" was changed from forall to exists, which hid this issue — among the fallbackReasons, one event's reason had degraded to the generic message because the physical plan tag was removed by RemoveFallbackTagRule, but exists only required some reasons to be correct, masking the fact that others were wrong.

Fix: In RemoveFallbackTagRule, skip untagging for nodes that have a FallbackTag but no logicalLink. Their tag stays on the physical node and is read by GlutenExplainUtils.handleVanillaSparkPlan via FallbackTags.getOption(p) (the existing fallback branch) in both the per-stage GlutenFallbackReporter event and the end-of-SQL plan walk. The tag is regenerated by Gluten validation on every AQE stage's postStageCreationRules, so it survives across AQE re-planning iterations as well.

With this fix, the test assertion is changed back to forall, verifying that all fallback events now report the correct reason.

Why not synthesize a logicalLink?

An earlier attempt synthesized a dummy LeafNode (FallbackLogicalPlan) and attached it via SparkPlan.LOGICAL_PLAN_TAG so the existing p.logicalLink.flatMap(FallbackTags.getOption) path would find it. That approach poisoned AdaptiveSparkPlanExec.setLogicalLinkForNewQueryStage, which intentionally relies on EnsureRequirements-generated nodes having no logicalLink in order to walk down to the real logical node. With a synthetic link in place, the stage's logicalLink was set to the dummy, which never appeared in the original logical plan; AQE's replaceWithQueryStages then could not insert the LogicalQueryStage wrapper, breaking AQE rules that match LogicalQueryStage(_, stage: ...QueryStageExec) such as AQEPropagateEmptyRelation.
Keeping the tag on the physical node sidesteps the AQE contract entirely.

Test plan

  • Existing test "get correct fallback reason on nodes without logicalLink" now uses forall and verifies all fallback reasons are correct

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Run Gluten Clickhouse CI on x86

@liujiayi771 liujiayi771 marked this pull request as draft May 4, 2026 05:42
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@liujiayi771
Copy link
Copy Markdown
Contributor Author

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Run Gluten Clickhouse CI on x86

@liujiayi771 liujiayi771 requested a review from zhztheplayer May 5, 2026 10:01
@liujiayi771
Copy link
Copy Markdown
Contributor Author

Hi @zhztheplayer @Zouxxyy. After #11853, GlutenQueryExecutionListener.onSQLExecutionEnd re-walks qe.executedPlan and posts a fresh GlutenPlanFallbackEvent, which overwrites the per-stage events in kvstore (last write wins). By that time every AQE stage has already gone through RemoveFallbackTagRule, so for nodes without a logicalLink (e.g. SortExec / BroadcastExchange from EnsureRequirements) the FallbackTag is gone. End result: the regression #11295 tried to fix is back, just hidden by the test being relaxed from forall to exists in #11853.

This PR takes the simplest fix: don't untag physical nodes that have no logicalLink in RemoveFallbackTagRule. Any downside you can think of to leaving the FallbackTag on the physical node in this case?

@liujiayi771 liujiayi771 marked this pull request as ready for review May 5, 2026 10:06
@zhztheplayer
Copy link
Copy Markdown
Member

For SparkPlan nodes that lack a logicalLink (e.g., SortExec / BroadcastExchange generated by EnsureRequirements), GlutenFallbackReporter cannot copy their FallbackTag onto a logical plan,

Just curious, does this behavior differ across Spark versions? I checked latest Spark code and the rule adopts transformUp which should've copied logical links to the sort / exchange nodes: https://github.com/apache/spark/blob/484bb5b2cd718d27872235a30c1af69d5d4d34d7/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L891

@liujiayi771
Copy link
Copy Markdown
Contributor Author

@zhztheplayer You're right that EnsureRequirements.apply uses transformUp, but transformUp only carries over tags (including logicalLink) to the node being directly replaced via copyTagsFrom. It does NOT set logicalLink on newly constructed child nodes.

In ensureDistributionAndOrdering, the new SortExec(...) / ShuffleExchangeExec(...) / BroadcastExchangeExec(...) are created directly without any setLogicalLink call. The transformUp rule matches the parent operator (e.g., SMJ) and calls reordered.withNewChildren(newChildren), which preserves the parent's logicalLink but does not propagate it to the newly inserted children.

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are fallback tags really as helpful as expected? If they just represent the fallback reasons that a simple logging mechanism can handle. Not meant to question the base design, but what I am seeing is the tagging logic is getting complicated. PR LGTM by the way.

…allbackTagRule

For SparkPlan nodes that lack a logicalLink (e.g., SortExec / BroadcastExchange
generated by EnsureRequirements), GlutenFallbackReporter cannot copy the fallback
reason onto a logical plan, so removing the FallbackTag on the physical node makes
GlutenQueryExecutionListener fall back to the generic 'Gluten does not touch it
or does not support it' reason at SQL execution end.

Fix: in RemoveFallbackTagRule, skip untagging for such nodes. Their tag stays on
the physical node and is read by GlutenExplainUtils.handleVanillaSparkPlan via
FallbackTags.getOption(p) in both the per-stage GlutenFallbackReporter event and
the end-of-SQL plan walk.

Note: an earlier attempt synthesized a dummy LeafNode and attached it via
SparkPlan.LOGICAL_PLAN_TAG. That poisoned AdaptiveSparkPlanExec.setLogicalLinkFor-
NewQueryStage, which relies on EnsureRequirements-generated nodes having no
logicalLink in order to walk down to the real logical node, and broke AQE-driven
join/empty-relation rewrites (SPARK-34533, SPARK-34781, SPARK-39551).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Run Gluten Clickhouse CI on x86

@liujiayi771 liujiayi771 merged commit eae60a5 into apache:main May 9, 2026
60 checks passed
@liujiayi771 liujiayi771 deleted the fallback-reason branch May 9, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants