feat(Clickhouse)!: Add support for multiple-suffix combined aggregate functions in Clickhouse dialect#7109
Conversation
…house dialect. Note: while implementing this I discovered that not all "standard" clickhouse aggregators are named in this parser. Not sure how it'd be possible to do so given that list changes as clickhouse evolves, but ¯\_(ツ)_/¯ One deviation from clickhouse `24.12.1.1332` (the version I happen to have installed): when a function name collides with a duplicated combinator (eg, `minMergeMerge`), the dialect accepts it (as minMerge + [Merge]), but the real clickhouse rejects it. However, I think this is actually a clickhouse bug so I'm leaving it as-is.
| 0 | ||
| ].assert_is(exp.ParameterizedAgg) | ||
|
|
||
| def test_agg_functions_multiple_suffixes(self): |
There was a problem hiding this comment.
I recommend starting from here as you review -- this outlines the new behaviors
| # Regression test: single-suffix | ||
| self.validate_identity("SELECT uniqExactIf(x, y) FROM t").selects[0].assert_is( | ||
| exp.CombinedAggFunc | ||
| ) |
There was a problem hiding this comment.
I decided against adding any 0-suffix tests because those are already covered by eg approx_top_sum tests above
| # Until we are able to identify a 1- or 0-suffix aggregate function by name, | ||
| # repeatedly strip and stack suffixes (longer suffixes first, see comment on | ||
| # AGG_FUNCTIONS_SUFFIXES_SORTED). This loop only runs for 2 or more suffixes, | ||
| # as AGG_FUNC_MAPPING memoizes all 0- and 1-suffix | ||
| accumulated_suffix_stack: t.List[str] = [] | ||
| while (parts := cls.AGG_FUNC_MAPPING.get(name)) is None: |
There was a problem hiding this comment.
Note that this "repeated stripping" only kicks in on the previously-unhandled cases. All 0- or 1-suffix aggregators (which were already supported via the memoized AGG_FUNC_MAPPING) continue to be implemented the same way. There's a tradeoff judgement here between "how big to make AGG_FUNC_MAPPING" vs "how fast to make _resolve_clickhouse_agg" -- Exponential space usage scares me, so I decided not to further extend the memoization introduced in #2734 (comment)
There was a problem hiding this comment.
I see the point. The current approach looks very costly (we should optimize). Can you benchmark for combinators that need a lot of stripping?
There was a problem hiding this comment.
Is benchmarking merited for this? The suffix stacks we're talking about are almost never going to be more than 2 -- you can contrive a 3- or 4-suffix aggregator (as I did in a unit test), but this is O(n*k) where n is "probably 2, probably not >4" and k is 15.
| "expressions": anon_func.expressions, | ||
| } | ||
| if parts[1]: | ||
| if len(parts[1]) > 0: |
There was a problem hiding this comment.
Honestly, kind of a shame that these suffixes aren't injected into the AST anywhere, but again, seemed out of scope for this PR. Ref: #4814
| last_seen_suffix = None | ||
| for suffix in reversed(accumulated_suffix_stack): | ||
| if suffix == last_seen_suffix: | ||
| return None # reject: there are adjacent duplicate combinators | ||
| suffixes_list.append(suffix) | ||
| last_seen_suffix = suffix |
There was a problem hiding this comment.
If we assume that the input is always correct (so we wont' have something like ssumMergeMerge). We just need to reverse here and make a list right ?
SQLGlot always assumes that the input is always correct.
There was a problem hiding this comment.
Correct. We could even use a dequeue and avoid the reverse, but I didn't want to overengineer too aggressively for an initial PR. I'm not familiar with the "assume the input is always correct" philosophy of SQLGlot, but I modelled this behavior after the parse_one("foobar(x)").assert_is(exp.Anonymous) assertion. As verified by one of the new unit tests, this doesn't fail parsing, it just causes the expression to parse as an exp.Anonymous. Would it be preferable to parse this as some variant of AggregateFunction? If so, should the foobar(x) case also be handled similarly?
Notwithstanding the above (just aiming to ensure all the information I based this implementation on is accounted for in a final decision), I'm happy to follow whatever semantic guidance existing contributors find appropriate,
| # Until we are able to identify a 1- or 0-suffix aggregate function by name, | ||
| # repeatedly strip and stack suffixes (longer suffixes first, see comment on | ||
| # AGG_FUNCTIONS_SUFFIXES_SORTED). This loop only runs for 2 or more suffixes, | ||
| # as AGG_FUNC_MAPPING memoizes all 0- and 1-suffix | ||
| accumulated_suffix_stack: t.List[str] = [] | ||
| while (parts := cls.AGG_FUNC_MAPPING.get(name)) is None: |
There was a problem hiding this comment.
I see the point. The current approach looks very costly (we should optimize). Can you benchmark for combinators that need a lot of stripping?
Add support for additional aggregate functions (motivated by
countIfMerge-- happy to open a bug issue if necessary).Note: while implementing this I discovered that not all "standard" clickhouse aggregators are named in this parser. Not sure how it'd be possible to do so given that list changes as clickhouse evolves, but ¯_(ツ)_/¯ I decided to omit handling those from the scope of this PR (as far as I'm aware, the only way to know for sure which aggregators are available is by querying system tables at runtime). I've sort of self-justified this as "looks like all the canonical names are supported, and few if any of the aliases"
There is still one deviation from clickhouse
24.12.1.1332(the version I happen to have installed): when a function name collides with a duplicated combinator (eg,minMergeMerge), the dialect in this PR accepts it (as minMerge + [Merge]), but the real clickhouse rejects it. However, I think this is actually a clickhouse bug, and it's not blocking my usecase, so I've left it as-is.