[SPARK-56854][PYTHON] Filter None values in DataFrame[Stream]Reader/Writer .option(s)#55867
Open
lavanv11 wants to merge 3 commits into
Open
[SPARK-56854][PYTHON] Filter None values in DataFrame[Stream]Reader/Writer .option(s)#55867lavanv11 wants to merge 3 commits into
lavanv11 wants to merge 3 commits into
Conversation
…riter .option(s) Aligns Classic PySpark with the Spark Connect Python client (SPARK-49263) and OptionUtils._set_opts.
HyukjinKwon
approved these changes
May 13, 2026
Yicong-Huang
approved these changes
May 14, 2026
Member
|
Seems like one test is realted. could we take a quick look and fix? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Filter None values in Classic PySpark's DataFrameReader, DataFrameWriter, DataFrameWriterV2, DataStreamReader, and DataStreamWriter
.option(key, value)and.options(**kwargs)methods. After this change,option(key, None)is a no-op andoptions(**{key: None, ...})drops the None entries before forwarding to the JVM. The loop-style methods mirror the shape ofOptionUtils._set_optsatpython/pyspark/sql/readwriter.py:41-53:for k, v in options.items(): if v is not None: ....Why are the changes needed?
Classic and Spark Connect Python currently disagree on what
option(key, None)means. Classic forwards Python None to the JVM as Java null, which several data sources interpret differently from "unset". For example, withspark.read.options(nullValue=None).schema("a STRING, b STRING").csv(path)and a row"",val, Classic produces[Row(a='', b='val')]while Connect produces[Row(a=None, b='val')]because Connect drops the None, the defaultnullValueof""stays in effect, and the quoted empty cell matches it. This PR aligns Classic with Connect (which has filtered None since SPARK-49263) and with the long-standingOptionUtils._set_optsconvention.Does this PR introduce any user-facing change?
Yes.
option(k, None)andoptions(**{k: None})were previously forwarded to the JVM as null; they are now no-ops. A migration-guide entry under "Upgrading from PySpark 4.1 to 4.2" documents the change. To set an option to its default, omit it or pass None; to set it to an empty string, pass""explicitly.How was this patch tested?
New parity test
test_option_none_is_filteredinReadwriterTestsMixinpins the CSVnullValue=Nonecase to[Row(a=None, b="val")]for both.optionand.options. BecauseReadwriterParityTestsinherits the mixin, the regression test runs on Classic and on Spark Connect, giving cross-backend coverage automatically.Additional defensive smoke tests guard the writer / V2 writer / streaming reader / streaming writer API contracts:
test_writer_option_none_chains_safelytest_v2_writer_option_none_chains_safelytest_stream_reader_option_none_chains_safelytest_stream_writer_option_none_chains_safelyWas this patch authored or co-authored using generative AI tooling?
Partially Generated-by: Claude Opus 4.7