[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718
[SPARK-56755][SQL] Fix SHOW CREATE TABLE for v2 table partitioned by bucket transform#55718pan3793 wants to merge 2 commits intoapache:masterfrom
Conversation
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem. ShowCreateTableExec.showTablePartitioning extracts every BucketTransform from a v2 table's partitioning into a bucketSpec variable in order to render it as the v1-style CLUSTERED BY (col) [SORTED BY (col)] INTO N BUCKETS clause. The variable is a single Option[BucketSpec], so when a v2 table has multiple bucket transforms (legal in Iceberg, etc.), each iteration overwrites the previous one and the SHOW output silently keeps only the last bucket — the example in the PR description shows an Iceberg table with bucket(4, user_id), bucket(2, item_id), dt rendering as PARTITIONED BY (dt) CLUSTERED BY (item_id) INTO 2 BUCKETS.
The CLUSTERED BY / INTO BUCKETS syntax is fundamentally a v1 (Hive) artifact: the v2 CREATE TABLE grammar already supports bucket(N, col) directly inside PARTITIONED BY, and V1Table.partitioning is the only Table type whose partitioning is guaranteed to carry at most one BucketTransform (it's built from a single Option[BucketSpec]).
Design approach. Restrict the v1-style extraction to V1Table via a pattern guard; let v2 tables fall through to the generic t.describe() branch so each BucketTransform is rendered inline as bucket(N, col) in PARTITIONED BY. This preserves existing output for v1 datasource tables that reach this exec (non-Hive v1 with spark.sql.legacy.useV1Command=false) and fixes the multi-bucket v2 case. As a side effect, single-bucket v2 tables also switch from CLUSTERED BY (b) INTO N BUCKETS to inline bucket(N, b) — the existing [multi-partition] test was updated to assert this.
Implementation sketch. One conditional in showTablePartitioning plus a defensive bucketSpec.nonEmpty throw mirroring CatalogV2Implicits.convertTransforms. The v1 ShowCreateTableCommand path (used for views, Hive tables, and v1 tables with useV1Command=true) operates on CatalogTable.bucketSpec directly and is unaffected.
A few low-priority observations inline / below; the core change LGTM.
The PR description and the new test focus on the multi-bucket case, but the change also alters output for v2 tables with a single bucket transform — the existing SPARK-33898: show create table[multi-partition] test had to be updated to expect PARTITIONED BY (a, bucket(16, b), ...) instead of separate CLUSTERED BY (b) INTO 16 BUCKETS. Worth one line in the description so users picking up the changelog know the single-bucket v2 case is also affected — the new format is more correct for v2 since v2 CREATE TABLE supports bucket(...) directly in PARTITIONED BY, but it is still a user-visible change.
| if (bucketSpec.nonEmpty) { | ||
| throw QueryExecutionErrors.unsupportedMultipleBucketTransformsError() | ||
| } |
There was a problem hiding this comment.
The defensive bucketSpec.nonEmpty throw can't fire on the V1Table arm: V1Table.partitioning constructs partitioning from v1Table.bucketSpec.foreach { spec => partitions += spec.asTransform }, so a V1Table is guaranteed to surface at most one BucketTransform. The pattern is borrowed from CatalogV2Implicits.convertTransforms where the input is user-supplied transforms and the guard is a real check, but here it's unreachable. Either drop it, or convert to an assert so future readers don't infer the V1Table path can produce multiple bucket transforms.
There was a problem hiding this comment.
replaced it with require assertion
|
Actually after a second thought, this v2 table behavior change can be avoided, we can still use the v1 syntax if there is only one bucket transform. |
|
@cloud-fan, seems we can not distinguish between in v1 table, the bucket is always under the leaf partition in the physical layout, but this is not true for v2 table. |
What changes were proposed in this pull request?
In
ShowCreateTableExec, transformBucketTransformtoCLUSTERED BY ... [SORTED BY ...] INTO n BUCKETSonly for v1 tables. For v2 tables, treatBucketTransformas a normal transform, preserved inPARTITIONED BY ...clause.Why are the changes needed?
BucketTransformis a specific case for v1 table, and it is restricted to have no more than one bucket transform. While such restrictions do not apply to v2 table, for example,SHOW CREATE TABLEoutput is incorrect and misleading for an iceberg table that is partitioned by two bucket transforms.Does this PR introduce any user-facing change?
Yes,
SHOW CREATE TABLE ...correctly displays thePARTITIONED BYclause for v2 table that has multi bucket partition transforms.How was this patch tested?
New UT.
Was this patch authored or co-authored using generative AI tooling?
No.