Skip to content

Add BigQuery CREATE EXTERNAL TABLE WITH PARTITION COLUMNS parsing support#66

Closed
mesmacosta wants to merge 2 commits intotobilg:mainfrom
mesmacosta:feat/bigquery-create-external-table
Closed

Add BigQuery CREATE EXTERNAL TABLE WITH PARTITION COLUMNS parsing support#66
mesmacosta wants to merge 2 commits intotobilg:mainfrom
mesmacosta:feat/bigquery-create-external-table

Conversation

@mesmacosta
Copy link
Copy Markdown
Contributor

Summary

BigQuery's CREATE EXTERNAL TABLE syntax supports clauses like WITH PARTITION COLUMNS and WITH CONNECTION that use the WITH keyword differently from the generic WITH properties parser. The generic parser was causing parse failures whenever column definitions or partition columns were present.

Example: failing query before this fix

CREATE EXTERNAL TABLE IF NOT EXISTS `{project}.{dataset}.{table}`
WITH PARTITION COLUMNS (
  table_name  STRING,
  sync_date   DATE,
  start_date  DATE,
  end_date    DATE,
  sync_id     STRING
)
OPTIONS (
  format                        = 'PARQUET',
  uris                          = ['gs://{bucket}/data/table_name={table}/*'],
  hive_partition_uri_prefix     = 'gs://{bucket}/data',
  require_hive_partition_filter = false
);
Error: Expected identifier, got LParen

Fix:

  • Consolidate all BigQuery EXTERNAL TABLE WITH handling into a single location at the WITH token consumption site, with a clean early return
  • Add EXTERNAL to BigQuery's special_modifier dialect list
  • Add generation support (builder + generator) for the new AST fields (with_partition_columns, with_connection)

This does not affect parsing of regular BigQuery CREATE TABLE statements, CTEs in AS SELECT, or any other dialect's WITH handling — the logic only activates when table_modifier == "EXTERNAL" AND dialect == BigQuery.

Changes

  • parser.rs — BigQuery EXTERNAL TABLE parsing + 19 new tests
  • expressions.rs — New with_partition_columns and with_connection fields on CreateTable
  • builder.rs — Builder support for new fields
  • generator.rs — SQL generation for BigQuery EXTERNAL TABLE syntax
  • dialects/mod.rs — Add EXTERNAL to BigQuery's special modifiers

Test plan

  • 12 new BigQuery EXTERNAL TABLE tests (basic, with columns, with partition columns, with connection, full, roundtrips, IF NOT EXISTS, OR REPLACE, PR description SQL)
  • 8 new BigQuery WITH syntax compatibility tests (regular AS SELECT, single CTE, multiple CTEs, PARTITION BY + CLUSTER BY + OPTIONS, OR REPLACE, IF NOT EXISTS)
  • All 20 BigQuery-specific tests pass
  • Full test suite: 917 passed, 0 failed, 0 regressions

- Parse CREATE EXTERNAL TABLE with column definitions, WITH PARTITION
  COLUMNS, WITH CONNECTION, and OPTIONS clauses
- Add is_bigquery_external guard to prevent generic with_properties
  from consuming the WITH token needed by BigQuery-specific clauses
- Add BigQuery EXTERNAL TABLE generation (builder + generator)
- Add EXTERNAL to BigQuery dialect's special_modifier list
- Add 19 tests covering parsing, roundtrip, and WITH syntax
  compatibility (CTEs, PARTITION BY, CLUSTER BY, OR REPLACE, etc.)
- All 916 tests pass with no regressions
@tobilg
Copy link
Copy Markdown
Owner

tobilg commented Mar 15, 2026

Nice work! Clean early-return pattern that keeps the BigQuery-specific logic well-scoped. A few things to address:

Must fix

match_identifier("COLUMNS") ignores its return value (parser.rs, around the WITH PARTITION COLUMNS parsing block):

  self.advance(); // consume PARTITION
  self.match_identifier("COLUMNS");   // ← return value discarded
  if self.check(TokenType::LParen) {

If someone writes WITH PARTITION (dt DATE) (missing COLUMNS), this silently accepts it — match_identifier returns false without advancing, and the LParen check still succeeds.
Should check the return value and error, e.g.:

  if !self.match_identifier("COLUMNS") {
      // error or handle bare WITH PARTITION case if BigQuery allows it
  }

Suggestions

  • Reversed clause order: The while loop allows WITH PARTITION COLUMNS and WITH CONNECTION in any order (good!), but there's no test for WITH CONNECTION ... WITH PARTITION
    COLUMNS .... Worth adding one to lock in that behavior.
  • Bare WITH PARTITION COLUMNS (no column list): BigQuery allows this for auto-detected hive partition columns. Currently the code just skips column parsing if no ( follows,
    which happens to work, but a dedicated test would make this intentional.
  • Duplicate clauses: Multiple WITH PARTITION COLUMNS clauses silently overwrite. Not a real-world concern but worth a comment in the code.

Nits

  • Extra blank line at the end of the BigQuery early-return block (between the closing } and // For DYNAMIC/ICEBERG/EXTERNAL tables...).

What looks good

  • Early-return keeps the generic WITH properties path untouched
  • Generator placement is correct (WITH PARTITION COLUMNS → WITH CONNECTION → OPTIONS)
  • Pretty-print and compact modes both handled
  • Great test coverage — especially the 8 compatibility tests verifying non-EXTERNAL BigQuery syntax isn't broken
  • Roundtrip tests for all major variants
  • Proper #[serde(default, skip_serializing_if = ...)] on new fields

@mesmacosta mesmacosta force-pushed the feat/bigquery-create-external-table branch from 1a9add1 to 0f99e5f Compare March 16, 2026 17:11
@mesmacosta
Copy link
Copy Markdown
Contributor Author

@tobilg addressed the review items.

@tobilg
Copy link
Copy Markdown
Owner

tobilg commented Mar 16, 2026

Will have a look, thank you!

@mesmacosta
Copy link
Copy Markdown
Contributor Author

@tobilg did you get a chance to look at this?

@MartinSahlen
Copy link
Copy Markdown

Hey @tobilg (I work with Marcelo), as I stated here (#28 (comment)) we really like this library and it's potential and we could see ourselves contribute - hence this PR.

One observation however is that while issues seem to get merged fast, there is currently 3 PRs that are moving very slow, comparably, even if contributors seem to turn around fast. In terms of adoption and "feeling good" about using a library like this I feel a bit uneasy. Our options are either keep using sqlparser-rs which is stable both in terms of functionality, adoption and community. My assumption is that this at the moment lives mainly on your computer and it's much easier to feed the issues via MCP or CLI directly to an LLM and crank out code vs integrating external code.

This is not a critique per se, we know where this library comes from and it's an impressive piece of work. But it would be good to understand your plans and thinking going forward. Will it stay a "pet project" along with other initiatives or would you look to add other maintainers etc?

tobilg added a commit that referenced this pull request Apr 7, 2026
@tobilg tobilg closed this Apr 7, 2026
@tobilg
Copy link
Copy Markdown
Owner

tobilg commented Apr 7, 2026

Hey @MartinSahlen and @mesmacosta,

thanks for the PR and your contributions. Sorry for the delay. During the last few weeks this repo was hammered with automated GH issues (assumingly) created by clankers, and I maintain multiple dozens of other projects, many of which are related to DuckDB which had two releases in short term as well that had me busy updating them as well.

This PR was stale and not mergeable since nearly three weeks if Claude Code is correct. I merged the PRs contents to the main branch this morning, this should be available in the next version 0.3.1 that will be released this week. I'll try to integrate the other PRs as well.

I was on holidays for one week and didn't touch a computer in this time. I attribute multiple hours a week to my OSS projects, and kept doing this for the last 8-10 years. I pay a monthly 3-digit dollar amount for my AI subscriptions, and a 2-digit dollar amount for running CI and test environments, all of it used to create and maintain my OSS projects, for free.

Regarding the question whether to use sqlparser-rs or polyglot, I think this is up to you guys to decide. I try my best to do many releases, some with very noticeable improvements regarding speed etc., but the project is currently just me with Claude Code & Codex, some occasional PRs and a lot of (automated) GH issue reports of sometimes questionable quality. All I can offer is best effort, as I do have a full day job... I'm open to onboard other maintainers, but so far nobody came forward with such an idea.

@MartinSahlen
Copy link
Copy Markdown

All good, as you understand this was not meant as harsh critique, more as an outsiders perspective and genuine question. Hence my question of where this sits in your "portfolio" of other OSS work and projects, and how you see it evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants