Skip to content

[SPARK-56968][SS] Force offset log VERSION_2 when streaming source evolution is enabled#56015

Open
ericm-db wants to merge 7 commits into
apache:masterfrom
ericm-db:spark-source-evolution-offset-log-v2
Open

[SPARK-56968][SS] Force offset log VERSION_2 when streaming source evolution is enabled#56015
ericm-db wants to merge 7 commits into
apache:masterfrom
ericm-db:spark-source-evolution-offset-log-v2

Conversation

@ericm-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When the streaming source evolution flag (spark.sql.streaming.queryEvolution.enableSourceEvolution) is set to true, force the offset log format to VERSION_2 for new streaming queries.

In MicroBatchExecution.initializeExecution, the offset log format version selection now takes max(STREAMING_OFFSET_LOG_FORMAT_VERSION, minRequiredVersion), where minRequiredVersion is VERSION_2 when source evolution is enabled and VERSION_1 otherwise. Existing queries continue to use whatever version is already written in their offset log (read from latestStartedBatch), so this only affects new queries.

The testWithSourceEvolution helper in StreamingSourceEvolutionSuite was updated to no longer set the offset log version explicitly, since it is now selected automatically.

Why are the changes needed?

Streaming source evolution relies on the OffsetMap (sourceId -> offset) format, which is only available in offset log VERSION_2. Previously, users had to remember to set spark.sql.streaming.offsetLog.formatVersion=2 alongside enabling source evolution; otherwise the format would default to VERSION_1 (sequence-based) and the named-source tracking required by source evolution would not function properly. Coupling the two configs eliminates a footgun.

Does this PR introduce any user-facing change?

No.

The change only affects new streaming queries that explicitly enable the internal spark.sql.streaming.queryEvolution.enableSourceEvolution flag. For such queries, the offset log will now use VERSION_2 automatically. Users who manually set the offset log version remain in control: the final version is max(configuredVersion, minRequiredVersion), so a user-configured VERSION_2 keeps working unchanged.

How was this patch tested?

  • Added offset log uses VERSION_2 when source evolution is enabled test in StreamingSourceEvolutionSuite.
  • Existing StreamingSourceEvolutionSuite tests pass after dropping the explicit offset log version from testWithSourceEvolution (19/19).
  • OffsetSeqLogSuite continues to pass (19/19).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

ericm-db added 3 commits May 20, 2026 20:56
…nnect client test helper

The streaming source evolution flag now forces offset log VERSION_2
automatically, so tests that enabled source evolution alongside an
explicit offsetLog.formatVersion=2 no longer need to set both.
@ericm-db ericm-db requested a review from anishshri-db May 21, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants