Skip to content

[SPARK-56854][PYTHON] Filter None values in DataFrame[Stream]Reader/Writer .option(s)#55867

Open
lavanv11 wants to merge 3 commits into
apache:masterfrom
lavanv11:pyspark_reader_inconsistencies
Open

[SPARK-56854][PYTHON] Filter None values in DataFrame[Stream]Reader/Writer .option(s)#55867
lavanv11 wants to merge 3 commits into
apache:masterfrom
lavanv11:pyspark_reader_inconsistencies

Conversation

@lavanv11
Copy link
Copy Markdown

What changes were proposed in this pull request?

Filter None values in Classic PySpark's DataFrameReader, DataFrameWriter, DataFrameWriterV2, DataStreamReader, and DataStreamWriter .option(key, value) and .options(**kwargs) methods. After this change, option(key, None) is a no-op and options(**{key: None, ...}) drops the None entries before forwarding to the JVM. The loop-style methods mirror the shape of OptionUtils._set_opts at python/pyspark/sql/readwriter.py:41-53: for k, v in options.items(): if v is not None: ....

Why are the changes needed?

Classic and Spark Connect Python currently disagree on what option(key, None) means. Classic forwards Python None to the JVM as Java null, which several data sources interpret differently from "unset". For example, with spark.read.options(nullValue=None).schema("a STRING, b STRING").csv(path) and a row "",val, Classic produces [Row(a='', b='val')] while Connect produces [Row(a=None, b='val')] because Connect drops the None, the default nullValue of "" stays in effect, and the quoted empty cell matches it. This PR aligns Classic with Connect (which has filtered None since SPARK-49263) and with the long-standing OptionUtils._set_opts convention.

Does this PR introduce any user-facing change?

Yes. option(k, None) and options(**{k: None}) were previously forwarded to the JVM as null; they are now no-ops. A migration-guide entry under "Upgrading from PySpark 4.1 to 4.2" documents the change. To set an option to its default, omit it or pass None; to set it to an empty string, pass "" explicitly.

How was this patch tested?

New parity test test_option_none_is_filtered in ReadwriterTestsMixin pins the CSV nullValue=None case to [Row(a=None, b="val")] for both .option and .options. Because ReadwriterParityTests inherits the mixin, the regression test runs on Classic and on Spark Connect, giving cross-backend coverage automatically.

Additional defensive smoke tests guard the writer / V2 writer / streaming reader / streaming writer API contracts:

  • test_writer_option_none_chains_safely
  • test_v2_writer_option_none_chains_safely
  • test_stream_reader_option_none_chains_safely
  • test_stream_writer_option_none_chains_safely

Was this patch authored or co-authored using generative AI tooling?

Partially Generated-by: Claude Opus 4.7

…riter .option(s)

Aligns Classic PySpark with the Spark Connect Python client (SPARK-49263)
and OptionUtils._set_opts.
@HyukjinKwon
Copy link
Copy Markdown
Member

Seems like one test is realted. could we take a quick look and fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants