fix(duckdb): fix for Snowflake transpilation issue related to PIVOT and string literal column names#7660
Conversation
|
@fivetran-kwoodbeck I think this is working as intended: Snowflake does produce delimited columns that contain the literal quotes in their name: I'm not sure if this is the right way to solve the problem. One idea that comes to mind is forcing the per-dialect pivot column names as outputs by using named aliases ( |
SQLGlot Integration Test ResultsComparing:
By Dialect
Overallmain: 113247 total, 112163 passed (pass rate: 99.0%), sqlglot version: sqlglot:transpile/snowflake-pivot-string-literal-column-names: 101037 total, 101037 passed (pass rate: 100.0%), sqlglot version: Transitions: Dialect pair changes: 0 previous results not found, 3 current results not found ✅ 67 test(s) passed |
You're right, that does seem more correct. I'll give it a try. |
1b4bc8b to
32a14cd
Compare
|
I checked the approach and thought about it a little more. I'm still skeptical, after all, and not sure if this is the best way to address the problem at hand. My gut instinct tells me that it is not.
Since every dialect's For example, this is my attempt at emulating Snowflake's naming scheme in DuckDB, without emitting a table alias ( To fix the original problem, we must respect Snowflake's (or whichever source dialect's) naming rules. Otherwise, enforcing DuckDB's naming and only fixing the projections may result in invalid queries, because |
|
@georgesittas One thing that might be an issue with As an alternative, updating the current Does that work? |
Based on our For transpilation purposes, I think we're only interested in the "normal"
I want to avoid an approach that messes up the existing roundtrip. The condition should be: if the columns that the pivot produces are different from the ones stored in Also, the above approach won't work when targeting BigQuery, right? Because despite dropping the alias columns, hence producing valid syntax, you'll still miss the proper column names that may be needed for the query's columns to be resolvable. |
32a14cd to
ce2b1be
Compare
|
Good points. You're right, multi field I implemented a new iteration in the base Let me know if this is closer to what you were thinking. |
| elif getattr( | ||
| getattr(self.dialect, "parser_class", None), "IDENTIFY_PIVOT_STRINGS", False | ||
| ): |
There was a problem hiding this comment.
Why are we using getattr here?
| # Only inject when SELECT * has already been expanded by qualify() | ||
| parent_select = expression.find_ancestor(exp.Select) | ||
| if not parent_select or any(isinstance(e, exp.Star) for e in parent_select.expressions): | ||
| return None |
There was a problem hiding this comment.
I don't think we want to do this. The expanded columns will still differ in the source and target engines if you don't rename them during transpilation, so you're just pushing the problem further downstream.
Also, for future reference, you can generally do is_star on queries.
| if not isinstance(e, (exp.Literal, exp.PivotAlias)): | ||
| return None |
There was a problem hiding this comment.
I don't think we should return here, what if some values have aliases but others don't?
|
|
||
| agg_aliases = [agg.alias for agg in expression.expressions if agg.alias] | ||
| step = len(agg_aliases) or 1 | ||
| suffix = ("_" + agg_aliases[0]) if agg_aliases else "" |
There was a problem hiding this comment.
Why is this picking only the first alias? You can have multiple aggregations, each with a different alias.
| # name would differ from the stored name recorded by the source dialect's parser. | ||
| new_field_exprs = self._pivot_in_value_aliases(expression) | ||
| if new_field_exprs is not None: | ||
| expression = expression.copy() |
There was a problem hiding this comment.
We shouldn't copy here, right? The generator preemptively copies all ASTs it converts into SQL.
| # For IDENTIFY_PIVOT_STRINGS targets (e.g. Snowflake), string literals in the | ||
| # IN-list already produce the correct column names natively, so strip any alias | ||
| # that _pivot_in_value_aliases may have injected for other dialects. | ||
| return self.sql(expression, "this") |
There was a problem hiding this comment.
I'm not sure I understand this comment, can you give an example for why this is needed? I'd expect a simple "do source/target names differ?" check to suffice for transpilation purposes.
| if i >= len(stored) or (suffix and not stored[i].name.endswith(suffix)): | ||
| return None | ||
| stored_name = stored[i].name[: -len(suffix)] if suffix else stored[i].name | ||
| if not isinstance(e, exp.PivotAlias) and stored_name.lower() != e.alias_or_name.lower(): |
There was a problem hiding this comment.
Note that some dialects don't use the aggregation alias as a suffix, but rather as a prefix. See here.
| new_exprs.append( | ||
| exp.PivotAlias(this=e, alias=exp.to_identifier(stored_name, quoted=True)) | ||
| ) | ||
| modified = True |
There was a problem hiding this comment.
Why is quoted set to True preemptively?
|
|
||
| return f" {tablesample_keyword or self.TABLESAMPLE_KEYWORDS} {method}{expr}{seed}" | ||
|
|
||
| def _pivot_in_value_aliases(self, expression: exp.Pivot) -> list[exp.Expression] | None: |
There was a problem hiding this comment.
I think there may be a better way to achieve this, instead of actually re-generating the output columns and comparing them with what's stored in columns.
Today, there are three "knobs" that determine the PIVOT output column naming scheme:
_pivot_column_namesIDENTIFY_PIVOT_STRINGSPREFIXED_PIVOT_COLUMNS
If we find a compact way to represent the information of what the value/logic for each of these knobs is for a dialect and store that in the AST, then it should be trivial to transpile by injecting PivotAliases: if source value vs target value differs, you directly add the aliases based on columns.
The bit that needs some thought is how you cleanly represent this info, given that the first one is currently a method that can be overridden (only in duckdb and spark now, but anyway).
Fixes a Snowflake to DuckDB
PIVOTtranspilation bug triggered afterqualify()expandsSELECT *. Snowflake storesPIVOT IN ('JAN', 'FEB')values as Identifier nodes with quotes included (e.g.'JAN'), which after star-expansion produces column refs like"_0"."'JAN'". DuckDB'sPIVOTdefaults to naming those columnsJAN(no quotes), so the references fail with aBinderException.The fix injects an explicit column alias list onto the
PIVOTalias in DuckDB output — e.g. AS"_0"("EMPID", "'JAN'", "'FEB'")which forces DuckDB to produce column names that match Snowflake's semantics. This preserves column name fidelity rather than silently stripping the quotes.