Skip to content

HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows#6484

Open
armitage420 wants to merge 1 commit into
apache:masterfrom
armitage420:HIVE-29613

Conversation

@armitage420
Copy link
Copy Markdown
Contributor

@armitage420 armitage420 commented May 14, 2026

What changes were proposed in this pull request?

Widen the cross-product gate in ConvertJoinMapJoin#getMapJoinConversion: when the smaller (broadcast-candidate) side's row count exceeds hive.xprod.mapjoin.small.table.rows but its onlineDataSize still fits hive.auto.convert.join.noconditionaltask.size, allow conversion to a broadcast map-join via a new helper crossProductBuildSideWithinBroadcastBudgetAfterRowCheck.

Why are the changes needed?

The current gate consults only the row count of the smaller (broadcast-candidate) side. NDV-driven filter selectivity routinely estimates tiny lookups at 2–3 rows even when their actual byte footprint is a few hundred bytes — well within the broadcast budget. The gate rejects these safe broadcasts, the join falls back to a MERGEJOIN with XPROD_EDGE inputs, and the keyless shuffle collapses onto a single reducer that materialises the entire Cartesian. See HIVE-29613 for full analysis.

Does this PR introduce any user-facing change?

Query results are unchanged. Although, For cross-product joins where the small side overshoots the row threshold but still fits the broadcast byte budget, EXPLAIN now shows MAPJOIN with BROADCAST_EDGE instead of MERGEJOIN with XPROD_EDGE on a Reducer.

How was this patch tested?

Added Five new unit tests in TestConvertJoinMapJoin class

@armitage420 armitage420 changed the title [WIP] HIVE-29613: Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows [WIP] HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows May 14, 2026
…erge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows
@armitage420 armitage420 changed the title [WIP] HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows May 14, 2026
@sonarqubecloud
Copy link
Copy Markdown

@kravii
Copy link
Copy Markdown

kravii commented May 15, 2026

LGTM +1
well-targeted fix for the NDV-estimation overshoot issue in cross-product joins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants