Skip to content

Comments

feat(Clickhouse)!: Add support for multiple-suffix combined aggregate functions in Clickhouse dialect#7109

Open
emanb29 wants to merge 1 commit intotobymao:mainfrom
hydrolix:bugfix/clickhouse-recognize-aggregate-funcs
Open

feat(Clickhouse)!: Add support for multiple-suffix combined aggregate functions in Clickhouse dialect#7109
emanb29 wants to merge 1 commit intotobymao:mainfrom
hydrolix:bugfix/clickhouse-recognize-aggregate-funcs

Conversation

@emanb29
Copy link

@emanb29 emanb29 commented Feb 19, 2026

Add support for additional aggregate functions (motivated by countIfMerge -- happy to open a bug issue if necessary).

Note: while implementing this I discovered that not all "standard" clickhouse aggregators are named in this parser. Not sure how it'd be possible to do so given that list changes as clickhouse evolves, but ¯_(ツ)_/¯ I decided to omit handling those from the scope of this PR (as far as I'm aware, the only way to know for sure which aggregators are available is by querying system tables at runtime). I've sort of self-justified this as "looks like all the canonical names are supported, and few if any of the aliases"

There is still one deviation from clickhouse 24.12.1.1332 (the version I happen to have installed): when a function name collides with a duplicated combinator (eg, minMergeMerge), the dialect in this PR accepts it (as minMerge + [Merge]), but the real clickhouse rejects it. However, I think this is actually a clickhouse bug, and it's not blocking my usecase, so I've left it as-is.

…house dialect.

Note: while implementing this I discovered that not all "standard" clickhouse aggregators are named in this parser. Not sure how it'd be possible to do so given that list changes as clickhouse evolves, but ¯\_(ツ)_/¯

One deviation from clickhouse `24.12.1.1332` (the version I happen to have installed): when a function name collides with a duplicated combinator (eg, `minMergeMerge`), the dialect accepts it (as minMerge + [Merge]), but the real clickhouse rejects it. However, I think this is actually a clickhouse bug so I'm leaving it as-is.
0
].assert_is(exp.ParameterizedAgg)

def test_agg_functions_multiple_suffixes(self):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend starting from here as you review -- this outlines the new behaviors

Comment on lines +1360 to +1363
# Regression test: single-suffix
self.validate_identity("SELECT uniqExactIf(x, y) FROM t").selects[0].assert_is(
exp.CombinedAggFunc
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against adding any 0-suffix tests because those are already covered by eg approx_top_sum tests above

@emanb29 emanb29 changed the title Add support for multiple-suffix combined aggregate functions in Clickhouse dialect feat(Clickhouse)!: Add support for multiple-suffix combined aggregate functions in Clickhouse dialect Feb 19, 2026
Comment on lines +588 to +593
# Until we are able to identify a 1- or 0-suffix aggregate function by name,
# repeatedly strip and stack suffixes (longer suffixes first, see comment on
# AGG_FUNCTIONS_SUFFIXES_SORTED). This loop only runs for 2 or more suffixes,
# as AGG_FUNC_MAPPING memoizes all 0- and 1-suffix
accumulated_suffix_stack: t.List[str] = []
while (parts := cls.AGG_FUNC_MAPPING.get(name)) is None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this "repeated stripping" only kicks in on the previously-unhandled cases. All 0- or 1-suffix aggregators (which were already supported via the memoized AGG_FUNC_MAPPING) continue to be implemented the same way. There's a tradeoff judgement here between "how big to make AGG_FUNC_MAPPING" vs "how fast to make _resolve_clickhouse_agg" -- Exponential space usage scares me, so I decided not to further extend the memoization introduced in #2734 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point. The current approach looks very costly (we should optimize). Can you benchmark for combinators that need a lot of stripping?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is benchmarking merited for this? The suffix stacks we're talking about are almost never going to be more than 2 -- you can contrive a 3- or 4-suffix aggregator (as I did in a unit test), but this is O(n*k) where n is "probably 2, probably not >4" and k is 15.

"expressions": anon_func.expressions,
}
if parts[1]:
if len(parts[1]) > 0:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, kind of a shame that these suffixes aren't injected into the AST anywhere, but again, seemed out of scope for this PR. Ref: #4814

Comment on lines +612 to +617
last_seen_suffix = None
for suffix in reversed(accumulated_suffix_stack):
if suffix == last_seen_suffix:
return None # reject: there are adjacent duplicate combinators
suffixes_list.append(suffix)
last_seen_suffix = suffix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the input is always correct (so we wont' have something like ssumMergeMerge). We just need to reverse here and make a list right ?

SQLGlot always assumes that the input is always correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We could even use a dequeue and avoid the reverse, but I didn't want to overengineer too aggressively for an initial PR. I'm not familiar with the "assume the input is always correct" philosophy of SQLGlot, but I modelled this behavior after the parse_one("foobar(x)").assert_is(exp.Anonymous) assertion. As verified by one of the new unit tests, this doesn't fail parsing, it just causes the expression to parse as an exp.Anonymous. Would it be preferable to parse this as some variant of AggregateFunction? If so, should the foobar(x) case also be handled similarly?

Notwithstanding the above (just aiming to ensure all the information I based this implementation on is accounted for in a final decision), I'm happy to follow whatever semantic guidance existing contributors find appropriate,

Comment on lines +588 to +593
# Until we are able to identify a 1- or 0-suffix aggregate function by name,
# repeatedly strip and stack suffixes (longer suffixes first, see comment on
# AGG_FUNCTIONS_SUFFIXES_SORTED). This loop only runs for 2 or more suffixes,
# as AGG_FUNC_MAPPING memoizes all 0- and 1-suffix
accumulated_suffix_stack: t.List[str] = []
while (parts := cls.AGG_FUNC_MAPPING.get(name)) is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point. The current approach looks very costly (we should optimize). Can you benchmark for combinators that need a lot of stripping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants