[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated #54237

vinodkc · 2026-02-09T22:48:33Z

What changes were proposed in this pull request?

In _SimpleStreamReaderWrapper.latestOffset(), validate that custom implementation of datasource based on SimpleDataSourceStreamReader.read() does not return a non-empty batch with end == start. If it does, raise PySparkException with error class SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE before appending to the cache. Empty batches with end == start remain allowed.

Why are the changes needed?

When a user implements read(start) incorrectly and returns:

Same offset for both: end = start (e.g. both {"offset": 0}).
Non-empty iterator: e.g. 2 rows.

If a reader returns end == start with data (e.g. return (it, {"offset": start_idx})), the wrapper keeps appending to its prefetch cache on every trigger while commit(end) never trims entries (first matching index is 0). The cache grows without bound and driver (non-JVM) memory increases until OOM. Validating and raising error before appending stops this and fails fast with a clear error.

Empty batches with end == start remain allowed , it will allow the Python data source to represent that there is no data to read.

Does this PR introduce any user-facing change?

Yes. Implementations that return end == start with a non-empty iterator now get PySparkException instead of unbounded memory growth. Empty batches with end == start are unchanged.

How was this patch tested?

Added unit test in test_python_streaming_datasource.py

Was this patch authored or co-authored using generative AI tooling?

No.

HeartSaVioR

Looks good in overall, only minor and nits. Nice finding!

python/docs/source/tutorial/sql/python_data_source.rst

python/pyspark/errors/error-conditions.json

python/pyspark/sql/datasource_internal.py

python/pyspark/sql/tests/test_python_streaming_datasource.py

vinodkc changed the title ~~[SPARK-55416][PYTHON][SS]Streaming Python Data Source memory leak when end-offset is not updated~~ [SPARK-55416][PYTHON][SS] Streaming Python Data Source memory leak when end-offset is not updated Feb 9, 2026

HyukjinKwon requested a review from HeartSaVioR February 9, 2026 23:10

HeartSaVioR reviewed Feb 10, 2026

View reviewed changes

vinodkc added 6 commits February 10, 2026 11:40

Validate SimpleDataSourceStreamReader end offset advances

e54702b

Cache only when start != end

78c3563

Add test for empty iter and start == end offset

3644539

Rename error class

71b4e1f

Change order of ERROR class

28cdcdc

Fix review comments

3c0b205

vinodkc force-pushed the br_SPARK-55416 branch from 81b2fd3 to 3c0b205 Compare February 10, 2026 19:41

Fix test failure

a290875

vinodkc changed the title ~~[SPARK-55416][PYTHON][SS] Streaming Python Data Source memory leak when end-offset is not updated~~ [SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated #54237

[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated #54237

vinodkc commented Feb 9, 2026 •

edited

Loading

Uh oh!

HeartSaVioR left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated #54237

Are you sure you want to change the base?

[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when end-offset is not updated #54237

Conversation

vinodkc commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinodkc commented Feb 9, 2026 •

edited

Loading