Skip to content

HIVE-29598: Fix vectorized outer join wrong results due to stale scratch column values#6486

Open
ryukobayashi wants to merge 1 commit into
apache:masterfrom
ryukobayashi:HIVE-29598
Open

HIVE-29598: Fix vectorized outer join wrong results due to stale scratch column values#6486
ryukobayashi wants to merge 1 commit into
apache:masterfrom
ryukobayashi:HIVE-29598

Conversation

@ryukobayashi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

In vectorized outer join, generateOuterNulls() and generateOuterNullsRepeatedAll() set isNull[i] = true on scratch columns but leave vector[i] untouched. When hive.vectorized.reuse.scratch.columns=true (the default), a scratch column slot
freed after an expression evaluation (e.g. CastStringToLong) can be reused for the outer join's null-marking column. After reset() clears isNull[], the expression overwrites vector[i] with a fresh value (e.g. 2025). Later, generateOuterNulls()
sets isNull[i] = true without clearing vector[i], leaving a stale non-zero value.

Downstream operators such as ColOrCol read vector[i] directly to distinguish "false" (== 0) from "null" (!= 0). The stale value causes null rows to be misinterpreted as "true", producing wrong OR/AND/CASE WHEN results.

The fix adds clearVectorValue(), called whenever isNull[i] is set to true in the outer join null-marking paths, zeroing vector[i] for all supported column vector types (LongColumnVector, DoubleColumnVector, BytesColumnVector,
TimestampColumnVector, IntervalDayTimeColumnVector).

Why are the changes needed?

Without the fix, vectorized outer joins silently return wrong results when scratch column reuse is enabled (the default). The bug is non-obvious because it only triggers when a specific combination of conditions is met: a type-casting expression allocates a scratch column that is later reused for the outer join's null-marking column, and the join result is consumed by a boolean operator that reads the raw vector value for null discrimination. Users have no indication that results are wrong; workarounds require disabling vectorization entirely (hive.vectorization.enabled=false) or disabling scratch column reuse (hive.vectorized.reuse.scratch.columns=false), both of which carry a significant performance cost.

Does this PR introduce any user-facing change?

No

How was this patch tested?

The existing TestMapJoinOperator suite (17 tests) passes without regression. The bug can also be verified manually with the minimal SQL reproducer below; with the fix applied, the result matches the expected output (C 2026 new, D 2026 new) that was previously only obtainable by disabling vectorization.

  CREATE TABLE src (k STRING, v STRING);
  INSERT INTO src VALUES
    ('p','1'),('p','2'),('p','3'),
    ('q','2'),('q','3'),
    ('r','3'),
    ('s','3');

  WITH base AS (
    SELECT k, v FROM src GROUP BY k, v
  ),
  classified AS (
    SELECT t1.k, t1.v,
           CASE WHEN COALESCE(t2.k,'') = '' THEN 'new'
                WHEN COALESCE(t3.k,'') = '' THEN 'two_step'
                ELSE 'three_step' END AS status
    FROM base t1
    LEFT JOIN base t2
      ON  t1.k = t2.k
      AND CAST(t1.v AS INT) - 1 = CAST(t2.v AS INT)
    LEFT JOIN base t3
      ON  t1.k = t3.k
      AND CAST(t1.v AS INT) - 2 = CAST(t3.v AS INT)
    WHERE CAST(t1.v AS INT) >= 3
    GROUP BY t1.k, t1.v,
             CASE WHEN COALESCE(t2.k,'') = '' THEN 'new'
                  WHEN COALESCE(t3.k,'') = '' THEN 'two_step'
                  ELSE 'three_step' END
  )
  SELECT * FROM classified WHERE status = 'new';

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants