Skip to content

[SPARK-56745][SQL] Cache foldable ZoneId in ConvertTimezone to avoid per-row lookup#55714

Open
TongWei1105 wants to merge 1 commit intoapache:masterfrom
TongWei1105:SPARK-convert-timezone-cache
Open

[SPARK-56745][SQL] Cache foldable ZoneId in ConvertTimezone to avoid per-row lookup#55714
TongWei1105 wants to merge 1 commit intoapache:masterfrom
TongWei1105:SPARK-convert-timezone-cache

Conversation

@TongWei1105
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

ConvertTimezone now resolves foldable source/target timezone arguments once (in @transient lazy val for the interpreted path, in addMutableState for the codegen path) instead of calling DateTimeUtils.getZoneId on every row. A new ZoneId-typed overload of DateTimeUtils.convertTimestampNtzToAnotherTz is added so the cached zones can be passed through directly.

Why are the changes needed?

The typical call site is convert_timezone('UTC', 'America/Los_Angeles', ts_col) where both timezone arguments are constant literals. Previously every row went through getZoneId twice -- zone-id normalization plus a ZoneRulesProvider map lookup, which is not free even when the resulting ZoneId is cached. The codegen paths of sibling expressions FromUTCTimestamp / ToUTCTimestamp already cache the foldable ZoneId via addMutableState; this brings ConvertTimezone in line.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

DateExpressionsSuite (which exercises both interpreted and codegen paths via checkEvaluation, plus checkConsistencyBetweenInterpretedAndCodegen across all foldable timezone pairs) -- all 76 tests pass. A new test SPARK-56745: convert_timezone with non-foldable timezone arguments covers the fallback path where source/target timezones are resolved per row, complementing the foldable-cache path covered by SPARK-37552.

Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with assistance from Claude Code (Anthropic).

…per-row lookup

### What changes were proposed in this pull request?

`ConvertTimezone` now resolves foldable source/target timezone arguments once
(in `@transient lazy val` for the interpreted path, in `addMutableState` for
the codegen path) instead of calling `DateTimeUtils.getZoneId` on every row.
A new `ZoneId`-typed overload of `DateTimeUtils.convertTimestampNtzToAnotherTz`
is added so the cached zones can be passed through directly.

### Why are the changes needed?

The typical call site is `convert_timezone('UTC', 'America/Los_Angeles', ts_col)`
where both timezone arguments are constant literals. Previously every row went
through `getZoneId` twice -- zone-id normalization plus a `ZoneRulesProvider`
map lookup, which is not free even when the resulting `ZoneId` is cached. The
codegen paths of sibling expressions `FromUTCTimestamp` / `ToUTCTimestamp`
already cache the foldable `ZoneId` via `addMutableState`; this brings
`ConvertTimezone` in line.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DateExpressionsSuite` (which exercises both interpreted and codegen paths via
`checkEvaluation`, plus `checkConsistencyBetweenInterpretedAndCodegen` across
all foldable timezone pairs) -- all 76 tests pass. A new test
`SPARK-56745: convert_timezone with non-foldable timezone arguments` covers
the fallback path where source/target timezones are resolved per row,
complementing the foldable-cache path covered by SPARK-37552.

### Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with assistance from Claude Code (Anthropic).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant