feat: Auto-plan complex reduction expressions#2298
Conversation
tswast
left a comment
There was a problem hiding this comment.
Neat! My main concern is around making sure we're all on the same page regarding terminology and intention for some of these internal classes and methods.
bigframes/core/array_value.py
Outdated
| *, | ||
| dropna: bool = False, | ||
| ): | ||
| # Warning: this function does not check if the expression is a valid reduction, and may fail spectacularly on invalid inputs |
There was a problem hiding this comment.
I wonder if there is anything we can do to get the type system to help out here?
There was a problem hiding this comment.
I don't think the python system is sufficiently powerful to do this, instead, will just have to settle for runtime checks. In this case, will get a ValueError if expression is not an aggregating one.
bigframes/core/blocks.py
Outdated
| index_labels=self._index_labels, | ||
| ) | ||
|
|
||
| # This is a new experimental version of the aggregate that supports mixing analytic and scalar expressions\ |
There was a problem hiding this comment.
Would be good to clarify how experimental. Are there particular combinations of inputs that aren't as well tested?
There was a problem hiding this comment.
Eh, just going to remove this line. Its tested well over the domains its been applied to so far.
| _MAX_INLINE_COMPLEXITY = 10 | ||
|
|
||
|
|
||
| def plan_general_col_exprs( |
There was a problem hiding this comment.
Like with compute, it might be good to document somewhere what we mean by "plan". Perhaps we should start a bigframes glossary?
There was a problem hiding this comment.
Changing name to apply_agg_exprs_to_plan, where the plan is the BFET and the exprs are the column expressions describing the full aggregation.
| """ | ||
| Applies arbitrary aggregation expressions to the block, optionally grouped by keys. | ||
|
|
||
| This method handles reduction operations (e.g., sum, mean, count) that collapse |
There was a problem hiding this comment.
How do we handle methods like APPROX_TOP_COUNT that can't be used in windows?
SELECT
pickup_datetime,
trip_distance,
APPROX_TOP_COUNT(trip_distance, 10) OVER (ORDER BY pickup_datetime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) AS rolling_sum_trip_distance_4_trips
FROM
`bigquery-public-data`.`new_york`.`tlc_green_trips_2014`
ORDER BY
pickup_datetime;Likewise, what about methods like PERCENTILE_CONT that require a window with an order by clause?
There was a problem hiding this comment.
Presumably, these ops are handled at a diiferent layer to avoid such cases? It'd be good to be explicit in the docs here that ops provided need to work as both analytical and aggregate functions. Or if we can use such functions here, what to do to avoid windows / force group by.
There was a problem hiding this comment.
yeah, best to catch these early. adding a windowizable prop to ops, and will throw if try to windowize non-windowizable op
bigframes/core/block_transforms.py
Outdated
|
|
||
| block, agg_ids = block.aggregate( | ||
| by_column_ids=grouping_column_ids, aggregations=aggregations | ||
| block, _ = block.reduce_general( |
There was a problem hiding this comment.
It's be worth a comment explaining why it's safe to ignore the second element of the returned tuple.
There was a problem hiding this comment.
The second element is mostly not used, so just removing it from the method entirely
bigframes/core/block_transforms.py
Outdated
| kurt_exprs.append(kurt_expr) | ||
|
|
||
| block = block.select_columns(kurt_ids).with_column_labels(column_labels) | ||
| block, _ = block.reduce_general( |
bigframes/core/blocks.py
Outdated
| dropna: bool = True, | ||
| ) -> typing.Tuple[Block, typing.Sequence[str]]: | ||
| """ | ||
| Version of the aggregate that supports mixing analytic and scalar expressions. |
There was a problem hiding this comment.
Since we still have aggregate, aggregate_size, and aggregate_all_and_stack let's say when to use them over this.
Alternatively, put an internal deprecated comment in the docstrings of those and file an issue for ourselves to cleanup uses of them.
There was a problem hiding this comment.
good point, removing aggregate and aggregate_size in this PR now.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕