fix: support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()` by shuoweil · Pull Request #2216 · googleapis/python-bigquery-dataframes

shuoweil · 2025-10-31T00:02:48Z

This commit addresses issue where creating empty DataFrames with nested JSON columns would fail due to PyArrow's inability to create empty arrays with db_dtypes.JSONArrowType (Apache Arrow issue #45262).

Changes:

First tries to create an empty Arrow table directly from the schema
If that fails with pa.ArrowNotImplementedError, falls back to using storage types (use_storage_types=True)
Converts the Arrow table to pandas, which properly preserves dtypes

This workaround is specifically needed for the anywidget backend which uses to_pandas_batches()

Fixes #<456577463> 🦕

review-notebook-app · 2025-10-31T00:02:54Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

This reverts commit 8c34512.

tswast

Thanks! Look good to me once typing and presubmit failures are addressed.

tswast · 2025-10-31T14:09:01Z

bigframes/core/blocks.py

+        for col in itertools.chain(self.value_columns, self.index_columns):
+            dtype = self.expr.get_column_type(col)
+            if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype):
+                # Due to a limitation in Apache Arrow (#45262), JSON columns are not
+                # natively supported by the to_pandas_batches() method, which is
+                # used by the anywidget backend.
+                # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273
+                # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType,
+                # especially when nested.
+                # Create with string type and then cast.
+
+                # MyPy doesn't automatically narrow the type of 'dtype' here,
+                # so we add an explicit check.
+                if isinstance(dtype, pd.ArrowDtype):
+                    safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string(
+                        dtype.pyarrow_dtype
+                    )
+                    safe_dtype = pd.ArrowDtype(safe_pa_type)
+                    series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype)
+                else:
+                    # This branch should ideally not be reached if
+                    # contains_db_dtypes_json_dtype is accurate,
+                    # but it's here for MyPy's sake.
+                    series_map[col] = pd.Series([], dtype=dtype)


@chelsea-lin I assume we have similar code that does this, right? Maybe there's something that could be reused here?

Yeah, we have something similar in the loader component but they're slightly different.

python-bigquery-dataframes/bigframes/session/loader.py

Line 1319 in 316ba9f

def _validate_dtype_can_load(name: str, column_type: bigframes.dtypes.Dtype):

Also, I agree that we can simply logic a little bit, for example:

dtype = pd.ArrowDtype(pa.list_(pa.struct([("key", db_dtypes.JSONArrowType())]))) try: s = pd.Series([], dtype=dtype) except pa.ArrowNotImplementedError as e: s = pd.Series([], dtype=pd.ArrowDtype(_replace_json_arrow_with_string(dtype.pyarrow_dtype))).astype(dtype)

The logic has been simplified

Thanks! The new logic looks even better!

tswast · 2025-10-31T14:12:16Z

Nit: I renamed the PR to be a little more user-oriented. Users don't care as much about the internal limitations. What changed from their perspective is that they can read STRUCT<JSON> and ARRAY<JSON> now.

chelsea-lin · 2025-10-31T17:28:28Z

bigframes/dtypes.py

    return contains_db_dtypes_json_arrow_type(dtype.pyarrow_dtype)


+def _replace_json_arrow_with_string(pa_type: pa.DataType) -> pa.DataType:


This function may be similar as the following two methods. Can you help to remove the one in the loader.py?

python-bigquery-dataframes/bigframes/session/loader.py

Line 1303 in 316ba9f

def _has_json_arrow_type(arrow_type: pa.DataType) -> bool:

python-bigquery-dataframes/bigframes/dtypes.py

Line 935 in 316ba9f

def contains_db_dtypes_json_arrow_type(type_):

Since I removed this function, this code refactor is no longer relevant to this PR. I will start a new PR (#2221) for this code refactor.

bigframes/core/blocks.py, unused function removed from bigframes/dtypes.py

…his refactor

construction of the empty DataFrame with the more robust try...except block that leverages to_pyarrow and empty_table

…sion

chelsea-lin · 2025-11-03T20:01:41Z

bigframes/core/blocks.py

+        try:
+            empty_arrow_table = self.expr.schema.to_pyarrow().empty_table()
+        except pa.ArrowNotImplementedError:
+            # Bug with some pyarrow versions, empty_table only supports base storage types, not extension types.


nit: can you please add the bug id in the docs.

chelsea-lin · 2025-11-03T20:02:02Z

bigframes/core/blocks.py

+        for col in itertools.chain(self.value_columns, self.index_columns):
+            dtype = self.expr.get_column_type(col)
+            if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype):
+                # Due to a limitation in Apache Arrow (#45262), JSON columns are not
+                # natively supported by the to_pandas_batches() method, which is
+                # used by the anywidget backend.
+                # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273
+                # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType,
+                # especially when nested.
+                # Create with string type and then cast.
+
+                # MyPy doesn't automatically narrow the type of 'dtype' here,
+                # so we add an explicit check.
+                if isinstance(dtype, pd.ArrowDtype):
+                    safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string(
+                        dtype.pyarrow_dtype
+                    )
+                    safe_dtype = pd.ArrowDtype(safe_pa_type)
+                    series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype)
+                else:
+                    # This branch should ideally not be reached if
+                    # contains_db_dtypes_json_dtype is accurate,
+                    # but it's here for MyPy's sake.
+                    series_map[col] = pd.Series([], dtype=dtype)


Thanks! The new logic looks even better!

🤖 I have created a release *beep* *boop* --- ## [2.29.0](v2.28.0...v2.29.0) (2025-11-10) ### Features * Add bigframes.bigquery.st_regionstats to join raster data from Earth Engine ([#2228](#2228)) ([10ec52f](10ec52f)) * Add DataFrame.resample and Series.resample ([#2213](#2213)) ([c9ca02c](c9ca02c)) * SQL Cell no longer escapes formatted string values ([#2245](#2245)) ([d2d38f9](d2d38f9)) * Support left_index and right_index for merge ([#2220](#2220)) ([da9ba26](da9ba26)) ### Bug Fixes * Correctly iterate over null struct values in ManagedArrowTable ([#2209](#2209)) ([12e04d5](12e04d5)) * Simplify UnsupportedTypeError message ([#2212](#2212)) ([6c9a18d](6c9a18d)) * Support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()` ([#2216](#2216)) ([3d8b17f](3d8b17f)) ### Documentation * Switch API reference docs to pydata theme ([#2237](#2237)) ([9b86dcf](9b86dcf)) * Update notebook for JSON subfields support in to_pandas_batches() ([#2138](#2138)) ([5663d2a](5663d2a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

shuoweil added 2 commits October 30, 2025 21:46

Correctly display DataFrames with JSON columns in anywidget

8c34512

Improve JSON type handling for to_gbq and to_pandas_batches

05e9b69

shuoweil requested review from chelsea-lin and tswast October 31, 2025 00:02

shuoweil self-assigned this Oct 31, 2025

shuoweil requested review from a team as code owners October 31, 2025 00:02

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Oct 31, 2025

Revert "Correctly display DataFrames with JSON columns in anywidget"

aa04bac

This reverts commit 8c34512.

product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Oct 31, 2025

Remove unnecessary comment

592e43b

tswast reviewed Oct 31, 2025

View reviewed changes

tswast changed the title ~~fix: Pyarrow limitation with empty nested JSON arrays in to_pandas_batches()~~ fix: support results with STRUCT and ARRAY columns containing JSON subfields in to_pandas_batches() Oct 31, 2025

chelsea-lin reviewed Oct 31, 2025

View reviewed changes

shuoweil added 4 commits October 31, 2025 17:58

Merge branch 'main' into shuowei-json-empty-dataframe

3b86941

code refactor

5955bfe

testcase update

d07ba7e

Merge branch 'main' into shuowei-json-empty-dataframe

9b1fb93

shuoweil requested review from chelsea-lin and tswast October 31, 2025 20:32

shuoweil added 6 commits October 31, 2025 21:27

Fix testcase

d7455a6

Merge branch 'main' into shuowei-json-empty-dataframe

066b8d6

function call updated in

12e2a63

bigframes/core/blocks.py, unused function removed from bigframes/dtypes.py

revert the code refactor in loader.py, I will use a seperate pr for t…

393a2f9

…his refactor

replace the manual

2ff0108

construction of the empty DataFrame with the more robust try...except block that leverages to_pyarrow and empty_table

fix testcase

512e3a1

shuoweil added 3 commits November 1, 2025 00:51

existing arrow_to_pandas() helper that properly handles dtype conver…

5f5881b

…sion

testcase update

3119771

refactor testcase

be1dea4

shuoweil added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 1, 2025

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 1, 2025

shuoweil added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 3, 2025

bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 3, 2025

chelsea-lin requested changes Nov 3, 2025

View reviewed changes

shuoweil added 2 commits November 3, 2025 21:30

Add pyarrow id to comments

5ed4293

Merge branch 'main' into shuowei-json-empty-dataframe

0d50c6d

shuoweil requested a review from chelsea-lin November 3, 2025 23:58

shuoweil enabled auto-merge (squash) November 3, 2025 23:59

chelsea-lin approved these changes Nov 4, 2025

View reviewed changes

shuoweil disabled auto-merge November 4, 2025 00:10

shuoweil merged commit 3d8b17f into main Nov 4, 2025
25 checks passed

shuoweil deleted the shuowei-json-empty-dataframe branch November 4, 2025 01:27

release-please bot mentioned this pull request Nov 4, 2025

chore(main): release 2.29.0 #2225

Closed

shuoweil mentioned this pull request Nov 4, 2025

docs: update notebook for JSON subfields support in to_pandas_batches() #2138

Merged

parthea mentioned this pull request Nov 7, 2025

chore: librarian release pull request: 20251107T184449Z #2241

Closed

release-please bot mentioned this pull request Nov 7, 2025

chore(main): release 2.29.0 #2242

Merged

gkevinzheng mentioned this pull request Nov 13, 2025

chore: librarian release pull request: 20251113T215818Z #2265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()`#2216

fix: support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()`#2216
shuoweil merged 19 commits intomainfrom
shuowei-json-empty-dataframe

shuoweil commented Oct 31, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Oct 31, 2025

Uh oh!

tswast left a comment

Uh oh!

tswast Oct 31, 2025

Uh oh!

chelsea-lin Oct 31, 2025

Uh oh!

shuoweil Oct 31, 2025 •

edited

Loading

Uh oh!

chelsea-lin Nov 3, 2025

Uh oh!

tswast commented Oct 31, 2025 •

edited

Loading

Uh oh!

chelsea-lin Oct 31, 2025

Uh oh!

shuoweil Oct 31, 2025 •

edited

Loading

Uh oh!

chelsea-lin Nov 3, 2025

Uh oh!

chelsea-lin Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		return contains_db_dtypes_json_arrow_type(dtype.pyarrow_dtype)


		def _replace_json_arrow_with_string(pa_type: pa.DataType) -> pa.DataType:

Conversation

shuoweil commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Oct 31, 2025

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

tswast Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

chelsea-lin Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

shuoweil Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chelsea-lin Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

tswast commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chelsea-lin Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

shuoweil Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chelsea-lin Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

chelsea-lin Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shuoweil commented Oct 31, 2025 •

edited

Loading

shuoweil Oct 31, 2025 •

edited

Loading

tswast commented Oct 31, 2025 •

edited

Loading

shuoweil Oct 31, 2025 •

edited

Loading