[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689
Open
andreAmorimF wants to merge 12 commits intoapache:masterfrom
Open
[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689andreAmorimF wants to merge 12 commits intoapache:masterfrom
andreAmorimF wants to merge 12 commits intoapache:masterfrom
Conversation
f799171 to
4b8f987
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add
Dataset.getNumPartitions: Intto the Spark Connect client (Scala and Python), matching the semantics ofdf.rdd.getNumPartitionsin classic Spark.In classic Spark users call
df.rdd.getNumPartitionsto get the number of partitions a DataFrame will produce during execution. Spark Connect's client-sideDatasetdoes not expose.rdd, so this operation was unavailable. This PR addsDataset.getNumPartitionsdirectly on the Connect clientDataset.Design choices:
getNumPartitions— matches classic RDD API, aids migrationDataset—df.getNumPartitionsexecutedPlan.execute().getNumPartitions— triggers physical planning (file splits, bucket assignments, etc.) and RDD partition construction, but no data scan. This is the same work classic Spark does forrdd.getNumPartitionsAnalyzePlancase (field 19 in request, 17 in response) — consistent withisLocal,isStreaming,inputFilesoutputPartitioning.numPartitionswas considered as a lighter alternative but rejected: the defaultSparkPlan.outputPartitioningreturnsUnknownPartitioning(0)for any operator that does not override it (includingExistingRDD,Expand, full-outerSortMergeOuterJoin, and all third-party operators), which would silently return0in those cases.Why are the changes needed?
df.rdd.getNumPartitionsis a commonly used operation in classic Spark for understanding physical partitioning without triggering data scanning. Spark Connect clients have no equivalent, making it harder to migrate workloads from classic Spark to Spark Connect.Does this PR introduce any user-facing change?
Yes — adds a new method
getNumPartitions(): IntonDatasetin both the Scala and Python Spark Connect clients.How was this patch tested?
SparkConnectServiceSuite("Test schema in analyze response") that validates the full server-side handler path using an embedded SparkSession:repartition(4).getNumPartitions === 4.ClientE2ETestSuite("Dataset inspection"):df.repartition(4).getNumPartitions === 4anddf.coalesce(1).getNumPartitions === 1. These exercise the full client → gRPC → server → response round-trip.test_get_num_partitionstotest_connect_basic.py.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with Claude (Anthropic).