Skip to content

fix(duckdb): fix for Snowflake transpilation issue related to PIVOT and string literal column names#7660

Open
fivetran-kwoodbeck wants to merge 5 commits into
mainfrom
transpile/snowflake-pivot-string-literal-column-names
Open

fix(duckdb): fix for Snowflake transpilation issue related to PIVOT and string literal column names#7660
fivetran-kwoodbeck wants to merge 5 commits into
mainfrom
transpile/snowflake-pivot-string-literal-column-names

Conversation

@fivetran-kwoodbeck
Copy link
Copy Markdown
Collaborator

@fivetran-kwoodbeck fivetran-kwoodbeck commented May 19, 2026

Fixes a Snowflake to DuckDB PIVOT transpilation bug triggered after qualify() expands SELECT *. Snowflake stores PIVOT IN ('JAN', 'FEB') values as Identifier nodes with quotes included (e.g. 'JAN'), which after star-expansion produces column refs like "_0"."'JAN'". DuckDB's PIVOT defaults to naming those columns JAN (no quotes), so the references fail with a BinderException.

The fix injects an explicit column alias list onto the PIVOT alias in DuckDB output — e.g. AS "_0"("EMPID", "'JAN'", "'FEB'") which forces DuckDB to produce column names that match Snowflake's semantics. This preserves column name fidelity rather than silently stripping the quotes.

@georgesittas
Copy link
Copy Markdown
Collaborator

@fivetran-kwoodbeck I think this is working as intended: Snowflake does produce delimited columns that contain the literal quotes in their name:

(sqlglot) ➜  sqlglot git:(main) runsf "SELECT *
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;"
SELECT *
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;
+---------------------------------+
| PRODUCT | 'JAN' | 'FEB' | 'MAR' |
|---------+-------+-------+-------|
| A       | 100   | 200   | 150   |
| B       | 300   | 250   | 400   |
+---------------------------------+

(sqlglot) ➜  sqlglot git:(main) runsf "SELECT \"'JAN'\"
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;"
SELECT "'JAN'"
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;
+-------+
| 'JAN' |
|-------|
| 100   |
| 300   |
+-------+

I'm not sure if this is the right way to solve the problem. One idea that comes to mind is forcing the per-dialect pivot column names as outputs by using named aliases (PIVOT(...) AS _0("'JAN'", ...), etc).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

SQLGlot Integration Test Results

Comparing:

  • this branch (sqlglot:transpile/snowflake-pivot-string-literal-column-names, sqlglot version: transpile/snowflake-pivot-string-literal-column-names)
  • baseline (main, sqlglot version: 0.0.1.dev1)

By Dialect

dialect main sqlglot:transpile/snowflake-pivot-string-literal-column-names transitions links
bigquery -> bigquery 24647/24652 passed (100.0%) 23497/23497 passed (100.0%) No change full result / delta
bigquery -> duckdb 867/1154 passed (75.1%) 0/0 passed (0.0%) Results not found full result / delta
duckdb -> duckdb 5823/5823 passed (100.0%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> duckdb 1123/1862 passed (60.3%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> snowflake 65190/65243 passed (99.9%) 63027/63027 passed (100.0%) No change full result / delta
databricks -> databricks 1370/1370 passed (100.0%) 1370/1370 passed (100.0%) No change full result / delta
postgres -> postgres 6042/6042 passed (100.0%) 6042/6042 passed (100.0%) No change full result / delta
redshift -> redshift 7101/7101 passed (100.0%) 7101/7101 passed (100.0%) No change full result / delta

Overall

main: 113247 total, 112163 passed (pass rate: 99.0%), sqlglot version: 0.0.1.dev1

sqlglot:transpile/snowflake-pivot-string-literal-column-names: 101037 total, 101037 passed (pass rate: 100.0%), sqlglot version: transpile/snowflake-pivot-string-literal-column-names

Transitions:
No change

Dialect pair changes: 0 previous results not found, 3 current results not found

✅ 67 test(s) passed

@fivetran-kwoodbeck
Copy link
Copy Markdown
Collaborator Author

@fivetran-kwoodbeck I think this is working as intended: Snowflake does produce delimited columns that contain the literal quotes in their name:

(sqlglot) ➜  sqlglot git:(main) runsf "SELECT *
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;"
SELECT *
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;
+---------------------------------+
| PRODUCT | 'JAN' | 'FEB' | 'MAR' |
|---------+-------+-------+-------|
| A       | 100   | 200   | 150   |
| B       | 300   | 250   | 400   |
+---------------------------------+

(sqlglot) ➜  sqlglot git:(main) runsf "SELECT \"'JAN'\"
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;"
SELECT "'JAN'"
  FROM (
    SELECT * FROM (VALUES
      ('A', 'JAN', 100),
      ('A', 'FEB', 200),
      ('A', 'MAR', 150),
      ('B', 'JAN', 300),
      ('B', 'FEB', 250),
      ('B', 'MAR', 400)
    ) AS sales(product, month, amount)
  )
  PIVOT (
    SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')
  ) AS p;
+-------+
| 'JAN' |
|-------|
| 100   |
| 300   |
+-------+

I'm not sure if this is the right way to solve the problem. One idea that comes to mind is forcing the per-dialect pivot column names as outputs by using named aliases (PIVOT(...) AS _0("'JAN'", ...), etc).

You're right, that does seem more correct. I'll give it a try.

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-pivot-string-literal-column-names branch 2 times, most recently from 1b4bc8b to 32a14cd Compare May 19, 2026 19:31
@georgesittas
Copy link
Copy Markdown
Collaborator

I checked the approach and thought about it a little more. I'm still skeptical, after all, and not sure if this is the best way to address the problem at hand. My gut instinct tells me that it is not.

  1. We iterate over pivot_cols every time we generate SQL for a Pivot, even when round-tripping. This feels unnecessary and counter-intuitive.
  2. We only handle references to these PIVOT-generated columns that are simple col AS alias and appear in the projection list of the parent SELECT. Things like col + 1 AS alias, or column references elsewhere, aren't detected.
  3. If you wanted to cover other dialect pairs, like Snowflake -> BigQuery, this approach would need to be duplicated.
  4. Related to (3), other naming schemes may not be transpiled correctly today, e.g., Spark/BigQuery -> DuckDB.

Since every dialect's PIVOT already computes its output column names as a function of the IN values and the aggregate functions used, aliasing the IN values would redirect that computation at the source. So, instead of fighting the target's naming, you tell it what to produce. This may work well but needs some investigation.

For example, this is my attempt at emulating Snowflake's naming scheme in DuckDB, without emitting a table alias (PIVOT ... AS t(col1, col2, ...)), which in BigQuery's case wouldn't be supported syntactically anyway:

(sqlglot) ➜  sqlglot git:(main) runddb "SELECT * FROM (SELECT * FROM (VALUES ('A', 'JAN', 100), ('A', 'FEB', 200), ('A', 'MAR', 150), ('B', 'JAN', 300), ('B', 'FEB', 250), ('B', 'MAR', 400)) AS sales(product, month, amount)) PIVOT(SUM(amount) FOR month IN ('JAN', 'FEB', 'MAR')) AS p"
┌─────────┬────────┬────────┬────────┐
│ product │  JAN   │  FEB   │  MAR   │
│ varchar │ int128 │ int128 │ int128 │
├─────────┼────────┼────────┼────────┤
│ B       │    300 │    250 │    400 │
│ A       │    100 │    200 │    150 │
└─────────┴────────┴────────┴────────┘
(sqlglot) ➜  sqlglot git:(main) runddb "SELECT * FROM (SELECT * FROM (VALUES ('A', 'JAN', 100), ('A', 'FEB', 200), ('A', 'MAR', 150), ('B', 'JAN', 300), ('B', 'FEB', 250), ('B', 'MAR', 400)) AS sales(product, month, amount)) PIVOT(SUM(amount) FOR month IN ('JAN' AS \"'JAN'\", 'FEB' AS \"'FEB'\", 'MAR' AS \"'MAR'\")) AS p"
┌─────────┬────────┬────────┬────────┐
│ product │ 'JAN'  │ 'FEB'  │ 'MAR'  │
│ varchar │ int128 │ int128 │ int128 │
├─────────┼────────┼────────┼────────┤
│ A       │    100 │    200 │    150 │
│ B       │    300 │    250 │    400 │
└─────────┴────────┴────────┴────────┘

To fix the original problem, we must respect Snowflake's (or whichever source dialect's) naming rules. Otherwise, enforcing DuckDB's naming and only fixing the projections may result in invalid queries, because Columns found elsewhere (e.g., filters) will be invalid / stale w.r.t. the rewrite.

@fivetran-kwoodbeck
Copy link
Copy Markdown
Collaborator Author

fivetran-kwoodbeck commented May 20, 2026

@georgesittas One thing that might be an issue with IN is the challenge with multi-field PIVOTs e.g. FOR country IN (...) year IN (...) since the output column names are a cartesian product across all the IN fields (e.g. nl_2000_amsterdam_total) which makes it unclear which IN to attach the alias to.

As an alternative, updating the current alias.columns approach, we can make it universal by simply removing the IDENTIFY_PIVOT_STRINGS guard and always set alias.columns on the PIVOT alias when pivot_columns is non-empty. This covers BigQuery (PREFIXED_PIVOT_COLUMNS) and any other source dialect's naming scheme automatically. Dialects that don't support AS alias(col_list) on PIVOT (e.g. BigQuery as a target) already have SUPPORTS_TABLE_ALIAS_COLUMNS = False, so they silently drop it.

Does that work?

@georgesittas
Copy link
Copy Markdown
Collaborator

georgesittas commented May 20, 2026

One thing that might be an issue with IN is the challenge with multi-field PIVOTs e.g. FOR country IN (...) year IN (...) since the output column names are a cartesian product across all the IN fields (e.g. nl_2000_amsterdam_total) which makes it unclear which IN to attach the alias to.

Based on our PIVOT column tests, I think only DuckDB supports multi-IN clause syntax (worth verifying, but checks out based on docs), so it should be trivial to mark it as "unsupported" in other dialects and avoid dealing with it.

For transpilation purposes, I think we're only interested in the "normal" PIVOT syntax with one IN clause, possible aliases and multiple aggregates. Can the alias injection scheme I suggested work, given these assumptions/facts?

As an alternative, updating the current alias.columns approach, we can make it universal by simply removing the IDENTIFY_PIVOT_STRINGS guard and always set alias.columns on the PIVOT alias when pivot_columns is non-empty. This covers BigQuery (PREFIXED_PIVOT_COLUMNS) and any other source dialect's naming scheme automatically. Dialects that don't support AS alias(col_list) on PIVOT (e.g. BigQuery as a target) already have SUPPORTS_TABLE_ALIAS_COLUMNS = False, so they silently drop it.

Does that work?

I want to avoid an approach that messes up the existing roundtrip. The condition should be: if the columns that the pivot produces are different from the ones stored in columns, it means we're transpiling and so we must inject aliases. This way, you will preserve the current roundtripping behavior.

Also, the above approach won't work when targeting BigQuery, right? Because despite dropping the alias columns, hence producing valid syntax, you'll still miss the proper column names that may be needed for the query's columns to be resolvable.

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-pivot-string-literal-column-names branch from 32a14cd to ce2b1be Compare May 22, 2026 18:15
@fivetran-kwoodbeck
Copy link
Copy Markdown
Collaborator Author

Good points. You're right, multi field PIVOT is DuckDB only, so for transpilation purposes we can assume there's only 1 IN clause.

I implemented a new iteration in the base pivot_sql. When the stored column name differs from what the target dialect would naturally produce, it adds a PivotAlias. The injection only happens after qualify() has expanded SELECT * and when it differs, so roundtrip stays clean and plain transpiles are unaffected.

Let me know if this is closer to what you were thinking.

@georgesittas georgesittas self-assigned this May 25, 2026
Comment thread sqlglot/generator.py
Comment on lines +3903 to +3905
elif getattr(
getattr(self.dialect, "parser_class", None), "IDENTIFY_PIVOT_STRINGS", False
):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using getattr here?

Comment thread sqlglot/generator.py
Comment on lines +2465 to +2468
# Only inject when SELECT * has already been expanded by qualify()
parent_select = expression.find_ancestor(exp.Select)
if not parent_select or any(isinstance(e, exp.Star) for e in parent_select.expressions):
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do this. The expanded columns will still differ in the source and target engines if you don't rename them during transpilation, so you're just pushing the problem further downstream.

Also, for future reference, you can generally do is_star on queries.

Comment thread sqlglot/generator.py
Comment on lines +2478 to +2479
if not isinstance(e, (exp.Literal, exp.PivotAlias)):
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should return here, what if some values have aliases but others don't?

Comment thread sqlglot/generator.py

agg_aliases = [agg.alias for agg in expression.expressions if agg.alias]
step = len(agg_aliases) or 1
suffix = ("_" + agg_aliases[0]) if agg_aliases else ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this picking only the first alias? You can have multiple aggregations, each with a different alias.

Comment thread sqlglot/generator.py
# name would differ from the stored name recorded by the source dialect's parser.
new_field_exprs = self._pivot_in_value_aliases(expression)
if new_field_exprs is not None:
expression = expression.copy()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't copy here, right? The generator preemptively copies all ASTs it converts into SQL.

Comment thread sqlglot/generator.py
Comment on lines +3906 to +3909
# For IDENTIFY_PIVOT_STRINGS targets (e.g. Snowflake), string literals in the
# IN-list already produce the correct column names natively, so strip any alias
# that _pivot_in_value_aliases may have injected for other dialects.
return self.sql(expression, "this")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment, can you give an example for why this is needed? I'd expect a simple "do source/target names differ?" check to suffice for transpilation purposes.

Comment thread sqlglot/generator.py
Comment on lines +2480 to +2483
if i >= len(stored) or (suffix and not stored[i].name.endswith(suffix)):
return None
stored_name = stored[i].name[: -len(suffix)] if suffix else stored[i].name
if not isinstance(e, exp.PivotAlias) and stored_name.lower() != e.alias_or_name.lower():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some dialects don't use the aggregation alias as a suffix, but rather as a prefix. See here.

Comment thread sqlglot/generator.py
Comment on lines +2484 to +2487
new_exprs.append(
exp.PivotAlias(this=e, alias=exp.to_identifier(stored_name, quoted=True))
)
modified = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is quoted set to True preemptively?

Comment thread sqlglot/generator.py

return f" {tablesample_keyword or self.TABLESAMPLE_KEYWORDS} {method}{expr}{seed}"

def _pivot_in_value_aliases(self, expression: exp.Pivot) -> list[exp.Expression] | None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a better way to achieve this, instead of actually re-generating the output columns and comparing them with what's stored in columns.

Today, there are three "knobs" that determine the PIVOT output column naming scheme:

  • _pivot_column_names
  • IDENTIFY_PIVOT_STRINGS
  • PREFIXED_PIVOT_COLUMNS

If we find a compact way to represent the information of what the value/logic for each of these knobs is for a dialect and store that in the AST, then it should be trivial to transpile by injecting PivotAliases: if source value vs target value differs, you directly add the aliases based on columns.

The bit that needs some thought is how you cleanly represent this info, given that the first one is currently a method that can be overridden (only in duckdb and spark now, but anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants