[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform by pan3793 · Pull Request #55718 · apache/spark

pan3793 · 2026-05-06T17:50:47Z

What changes were proposed in this pull request?

In ShowCreateTableExec, transform BucketTransform to CLUSTERED BY ... [SORTED BY ...] INTO n BUCKETS only for v1 tables. For v2 tables, treat BucketTransform as a normal transform, preserved in PARTITIONED BY ... clause.

Why are the changes needed?

BucketTransform is a specific case for v1 table, and it is restricted to have no more than one bucket transform. While such restrictions do not apply to v2 table, for example, SHOW CREATE TABLE output is incorrect and misleading for an iceberg table that is partitioned by two bucket transforms.

spark-sql (default)> create table t1(id int, user_id int, item_id int, dt string) using iceberg partitioned by (bucket(4, user_id), bucket(2, item_id), dt);
Time taken: 1.397 seconds
spark-sql (default)> show create table t1;
CREATE TABLE spark_catalog.default.t1 (
  id INT,
  user_id INT,
  item_id INT,
  dt STRING COLLATE UTF8_BINARY)
USING iceberg
PARTITIONED BY (dt)
CLUSTERED BY (item_id)
INTO 2 BUCKETS
LOCATION 'hdfs://hadoop-master1.orb.local:8020/warehouse/t1'
TBLPROPERTIES (
  'current-snapshot-id' = 'none',
  'format' = 'iceberg/parquet',
  'format-version' = '2',
  'write.parquet.compression-codec' = 'zstd')

Time taken: 0.253 seconds, Fetched 1 row(s)

Does this PR introduce any user-facing change?

Yes, SHOW CREATE TABLE ... correctly displays the PARTITIONED BY clause for v2 table that has multi bucket partition transforms.

How was this patch tested?

New UT.

Was this patch authored or co-authored using generative AI tooling?

No.

…bucket transform

pan3793 · 2026-05-07T02:23:53Z

cc @peter-toth @cloud-fan

cloud-fan

Summary

Prior state and problem. ShowCreateTableExec.showTablePartitioning extracts every BucketTransform from a v2 table's partitioning into a bucketSpec variable in order to render it as the v1-style CLUSTERED BY (col) [SORTED BY (col)] INTO N BUCKETS clause. The variable is a single Option[BucketSpec], so when a v2 table has multiple bucket transforms (legal in Iceberg, etc.), each iteration overwrites the previous one and the SHOW output silently keeps only the last bucket — the example in the PR description shows an Iceberg table with bucket(4, user_id), bucket(2, item_id), dt rendering as PARTITIONED BY (dt) CLUSTERED BY (item_id) INTO 2 BUCKETS.

The CLUSTERED BY / INTO BUCKETS syntax is fundamentally a v1 (Hive) artifact: the v2 CREATE TABLE grammar already supports bucket(N, col) directly inside PARTITIONED BY, and V1Table.partitioning is the only Table type whose partitioning is guaranteed to carry at most one BucketTransform (it's built from a single Option[BucketSpec]).

Design approach. Restrict the v1-style extraction to V1Table via a pattern guard; let v2 tables fall through to the generic t.describe() branch so each BucketTransform is rendered inline as bucket(N, col) in PARTITIONED BY. This preserves existing output for v1 datasource tables that reach this exec (non-Hive v1 with spark.sql.legacy.useV1Command=false) and fixes the multi-bucket v2 case. As a side effect, single-bucket v2 tables also switch from CLUSTERED BY (b) INTO N BUCKETS to inline bucket(N, b) — the existing [multi-partition] test was updated to assert this.

Implementation sketch. One conditional in showTablePartitioning plus a defensive bucketSpec.nonEmpty throw mirroring CatalogV2Implicits.convertTransforms. The v1 ShowCreateTableCommand path (used for views, Hive tables, and v1 tables with useV1Command=true) operates on CatalogTable.bucketSpec directly and is unaffected.

A few low-priority observations inline / below; the core change LGTM.

The PR description and the new test focus on the multi-bucket case, but the change also alters output for v2 tables with a single bucket transform — the existing SPARK-33898: show create table[multi-partition] test had to be updated to expect PARTITIONED BY (a, bucket(16, b), ...) instead of separate CLUSTERED BY (b) INTO 16 BUCKETS. Worth one line in the description so users picking up the changelog know the single-bucket v2 case is also affected — the new format is more correct for v2 since v2 CREATE TABLE supports bucket(...) directly in PARTITIONED BY, but it is still a user-visible change.

cloud-fan · 2026-05-07T05:40:24Z

+          if (bucketSpec.nonEmpty) {
+            throw QueryExecutionErrors.unsupportedMultipleBucketTransformsError()
+          }


The defensive bucketSpec.nonEmpty throw can't fire on the V1Table arm: V1Table.partitioning constructs partitioning from v1Table.bucketSpec.foreach { spec => partitions += spec.asTransform }, so a V1Table is guaranteed to surface at most one BucketTransform. The pattern is borrowed from CatalogV2Implicits.convertTransforms where the input is user-supplied transforms and the guard is a real check, but here it's unreachable. Either drop it, or convert to an assert so future readers don't infer the V1Table path can produce multiple bucket transforms.

replaced it with require assertion

cloud-fan · 2026-05-07T05:44:20Z

Actually after a second thought, this v2 table behavior change can be avoided, we can still use the v1 syntax if there is only one bucket transform.

pan3793 · 2026-05-07T06:03:34Z

@cloud-fan, seems we can not distinguish between partitioned by (bucket(4, user_id), dt) and partitioned by (dt, bucket(4, user_id)) using the v1 syntax.

in v1 table, the bucket is always under the leaf partition in the physical layout, but this is not true for v2 table.

[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by …

4963c27

…bucket transform

cloud-fan approved these changes May 7, 2026

View reviewed changes

replace with require assertion

ac4abfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718

[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:SPARK-56755

pan3793 commented May 6, 2026

Uh oh!

pan3793 commented May 7, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan May 7, 2026

Uh oh!

pan3793 May 7, 2026

Uh oh!

cloud-fan commented May 7, 2026

Uh oh!

pan3793 commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pan3793 commented May 6, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented May 7, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

cloud-fan May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 7, 2026

Uh oh!

pan3793 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pan3793 commented May 7, 2026 •

edited

Loading