Skip to content

[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718

Open
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:SPARK-56755
Open

[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:SPARK-56755

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented May 6, 2026

What changes were proposed in this pull request?

In ShowCreateTableExec, transform BucketTransform to CLUSTERED BY ... [SORTED BY ...] INTO n BUCKETS only for v1 tables. For v2 tables, treat BucketTransform as a normal transform, preserved in PARTITIONED BY ... clause.

Why are the changes needed?

BucketTransform is a specific case for v1 table, and it is restricted to have no more than one bucket transform. While such restrictions do not apply to v2 table, for example, SHOW CREATE TABLE output is incorrect and misleading for an iceberg table that is partitioned by two bucket transforms.

spark-sql (default)> create table t1(id int, user_id int, item_id int, dt string) using iceberg partitioned by (bucket(4, user_id), bucket(2, item_id), dt);
Time taken: 1.397 seconds
spark-sql (default)> show create table t1;
CREATE TABLE spark_catalog.default.t1 (
  id INT,
  user_id INT,
  item_id INT,
  dt STRING COLLATE UTF8_BINARY)
USING iceberg
PARTITIONED BY (dt)
CLUSTERED BY (item_id)
INTO 2 BUCKETS
LOCATION 'hdfs://hadoop-master1.orb.local:8020/warehouse/t1'
TBLPROPERTIES (
  'current-snapshot-id' = 'none',
  'format' = 'iceberg/parquet',
  'format-version' = '2',
  'write.parquet.compression-codec' = 'zstd')

Time taken: 0.253 seconds, Fetched 1 row(s)

Does this PR introduce any user-facing change?

Yes, SHOW CREATE TABLE ... correctly displays the PARTITIONED BY clause for v2 table that has multi bucket partition transforms.

How was this patch tested?

New UT.

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 7, 2026

cc @peter-toth @cloud-fan

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. ShowCreateTableExec.showTablePartitioning extracts every BucketTransform from a v2 table's partitioning into a bucketSpec variable in order to render it as the v1-style CLUSTERED BY (col) [SORTED BY (col)] INTO N BUCKETS clause. The variable is a single Option[BucketSpec], so when a v2 table has multiple bucket transforms (legal in Iceberg, etc.), each iteration overwrites the previous one and the SHOW output silently keeps only the last bucket — the example in the PR description shows an Iceberg table with bucket(4, user_id), bucket(2, item_id), dt rendering as PARTITIONED BY (dt) CLUSTERED BY (item_id) INTO 2 BUCKETS.

The CLUSTERED BY / INTO BUCKETS syntax is fundamentally a v1 (Hive) artifact: the v2 CREATE TABLE grammar already supports bucket(N, col) directly inside PARTITIONED BY, and V1Table.partitioning is the only Table type whose partitioning is guaranteed to carry at most one BucketTransform (it's built from a single Option[BucketSpec]).

Design approach. Restrict the v1-style extraction to V1Table via a pattern guard; let v2 tables fall through to the generic t.describe() branch so each BucketTransform is rendered inline as bucket(N, col) in PARTITIONED BY. This preserves existing output for v1 datasource tables that reach this exec (non-Hive v1 with spark.sql.legacy.useV1Command=false) and fixes the multi-bucket v2 case. As a side effect, single-bucket v2 tables also switch from CLUSTERED BY (b) INTO N BUCKETS to inline bucket(N, b) — the existing [multi-partition] test was updated to assert this.

Implementation sketch. One conditional in showTablePartitioning plus a defensive bucketSpec.nonEmpty throw mirroring CatalogV2Implicits.convertTransforms. The v1 ShowCreateTableCommand path (used for views, Hive tables, and v1 tables with useV1Command=true) operates on CatalogTable.bucketSpec directly and is unaffected.

A few low-priority observations inline / below; the core change LGTM.

The PR description and the new test focus on the multi-bucket case, but the change also alters output for v2 tables with a single bucket transform — the existing SPARK-33898: show create table[multi-partition] test had to be updated to expect PARTITIONED BY (a, bucket(16, b), ...) instead of separate CLUSTERED BY (b) INTO 16 BUCKETS. Worth one line in the description so users picking up the changelog know the single-bucket v2 case is also affected — the new format is more correct for v2 since v2 CREATE TABLE supports bucket(...) directly in PARTITIONED BY, but it is still a user-visible change.

Comment on lines +103 to +105
if (bucketSpec.nonEmpty) {
throw QueryExecutionErrors.unsupportedMultipleBucketTransformsError()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defensive bucketSpec.nonEmpty throw can't fire on the V1Table arm: V1Table.partitioning constructs partitioning from v1Table.bucketSpec.foreach { spec => partitions += spec.asTransform }, so a V1Table is guaranteed to surface at most one BucketTransform. The pattern is borrowed from CatalogV2Implicits.convertTransforms where the input is user-supplied transforms and the guard is a real check, but here it's unreachable. Either drop it, or convert to an assert so future readers don't infer the V1Table path can produce multiple bucket transforms.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with require assertion

@cloud-fan
Copy link
Copy Markdown
Contributor

Actually after a second thought, this v2 table behavior change can be avoided, we can still use the v1 syntax if there is only one bucket transform.

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 7, 2026

@cloud-fan, seems we can not distinguish between partitioned by (bucket(4, user_id), dt) and partitioned by (dt, bucket(4, user_id)) using the v1 syntax.

in v1 table, the bucket is always under the leaf partition in the physical layout, but this is not true for v2 table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants