Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This reverts commit 8c34512.
tswast
left a comment
There was a problem hiding this comment.
Thanks! Look good to me once typing and presubmit failures are addressed.
bigframes/core/blocks.py
Outdated
| for col in itertools.chain(self.value_columns, self.index_columns): | ||
| dtype = self.expr.get_column_type(col) | ||
| if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype): | ||
| # Due to a limitation in Apache Arrow (#45262), JSON columns are not | ||
| # natively supported by the to_pandas_batches() method, which is | ||
| # used by the anywidget backend. | ||
| # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273 | ||
| # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType, | ||
| # especially when nested. | ||
| # Create with string type and then cast. | ||
|
|
||
| # MyPy doesn't automatically narrow the type of 'dtype' here, | ||
| # so we add an explicit check. | ||
| if isinstance(dtype, pd.ArrowDtype): | ||
| safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string( | ||
| dtype.pyarrow_dtype | ||
| ) | ||
| safe_dtype = pd.ArrowDtype(safe_pa_type) | ||
| series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype) | ||
| else: | ||
| # This branch should ideally not be reached if | ||
| # contains_db_dtypes_json_dtype is accurate, | ||
| # but it's here for MyPy's sake. | ||
| series_map[col] = pd.Series([], dtype=dtype) |
There was a problem hiding this comment.
@chelsea-lin I assume we have similar code that does this, right? Maybe there's something that could be reused here?
There was a problem hiding this comment.
Yeah, we have something similar in the loader component but they're slightly different.
Also, I agree that we can simply logic a little bit, for example:
dtype = pd.ArrowDtype(pa.list_(pa.struct([("key", db_dtypes.JSONArrowType())])))
try:
s = pd.Series([], dtype=dtype)
except pa.ArrowNotImplementedError as e:
s = pd.Series([], dtype=pd.ArrowDtype(_replace_json_arrow_with_string(dtype.pyarrow_dtype))).astype(dtype)
There was a problem hiding this comment.
The logic has been simplified
There was a problem hiding this comment.
Thanks! The new logic looks even better!
to_pandas_batches()
|
Nit: I renamed the PR to be a little more user-oriented. Users don't care as much about the internal limitations. What changed from their perspective is that they can read |
bigframes/dtypes.py
Outdated
| return contains_db_dtypes_json_arrow_type(dtype.pyarrow_dtype) | ||
|
|
||
|
|
||
| def _replace_json_arrow_with_string(pa_type: pa.DataType) -> pa.DataType: |
There was a problem hiding this comment.
This function may be similar as the following two methods. Can you help to remove the one in the loader.py?
There was a problem hiding this comment.
Since I removed this function, this code refactor is no longer relevant to this PR. I will start a new PR (#2221) for this code refactor.
bigframes/core/blocks.py, unused function removed from bigframes/dtypes.py
construction of the empty DataFrame with the more robust try...except block that leverages to_pyarrow and empty_table
bigframes/core/blocks.py
Outdated
| try: | ||
| empty_arrow_table = self.expr.schema.to_pyarrow().empty_table() | ||
| except pa.ArrowNotImplementedError: | ||
| # Bug with some pyarrow versions, empty_table only supports base storage types, not extension types. |
There was a problem hiding this comment.
nit: can you please add the bug id in the docs.
bigframes/core/blocks.py
Outdated
| for col in itertools.chain(self.value_columns, self.index_columns): | ||
| dtype = self.expr.get_column_type(col) | ||
| if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype): | ||
| # Due to a limitation in Apache Arrow (#45262), JSON columns are not | ||
| # natively supported by the to_pandas_batches() method, which is | ||
| # used by the anywidget backend. | ||
| # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273 | ||
| # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType, | ||
| # especially when nested. | ||
| # Create with string type and then cast. | ||
|
|
||
| # MyPy doesn't automatically narrow the type of 'dtype' here, | ||
| # so we add an explicit check. | ||
| if isinstance(dtype, pd.ArrowDtype): | ||
| safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string( | ||
| dtype.pyarrow_dtype | ||
| ) | ||
| safe_dtype = pd.ArrowDtype(safe_pa_type) | ||
| series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype) | ||
| else: | ||
| # This branch should ideally not be reached if | ||
| # contains_db_dtypes_json_dtype is accurate, | ||
| # but it's here for MyPy's sake. | ||
| series_map[col] = pd.Series([], dtype=dtype) |
There was a problem hiding this comment.
Thanks! The new logic looks even better!
🤖 I have created a release *beep* *boop* --- ## [2.29.0](v2.28.0...v2.29.0) (2025-11-10) ### Features * Add bigframes.bigquery.st_regionstats to join raster data from Earth Engine ([#2228](#2228)) ([10ec52f](10ec52f)) * Add DataFrame.resample and Series.resample ([#2213](#2213)) ([c9ca02c](c9ca02c)) * SQL Cell no longer escapes formatted string values ([#2245](#2245)) ([d2d38f9](d2d38f9)) * Support left_index and right_index for merge ([#2220](#2220)) ([da9ba26](da9ba26)) ### Bug Fixes * Correctly iterate over null struct values in ManagedArrowTable ([#2209](#2209)) ([12e04d5](12e04d5)) * Simplify UnsupportedTypeError message ([#2212](#2212)) ([6c9a18d](6c9a18d)) * Support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()` ([#2216](#2216)) ([3d8b17f](3d8b17f)) ### Documentation * Switch API reference docs to pydata theme ([#2237](#2237)) ([9b86dcf](9b86dcf)) * Update notebook for JSON subfields support in to_pandas_batches() ([#2138](#2138)) ([5663d2a](5663d2a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
This commit addresses issue where creating empty DataFrames with nested JSON columns would fail due to PyArrow's inability to create empty arrays with db_dtypes.JSONArrowType (Apache Arrow issue #45262).
Changes:
This workaround is specifically needed for the anywidget backend which uses to_pandas_batches()
Fixes #<456577463> 🦕