Skip to content

[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689

Open
andreAmorimF wants to merge 12 commits intoapache:masterfrom
andreAmorimF:feature/spark-connect-get-num-partitions
Open

[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689
andreAmorimF wants to merge 12 commits intoapache:masterfrom
andreAmorimF:feature/spark-connect-get-num-partitions

Conversation

@andreAmorimF
Copy link
Copy Markdown

What changes were proposed in this pull request?

Add Dataset.getNumPartitions: Int to the Spark Connect client (Scala and Python), matching the semantics of df.rdd.getNumPartitions in classic Spark.

In classic Spark users call df.rdd.getNumPartitions to get the number of partitions a DataFrame will produce during execution. Spark Connect's client-side Dataset does not expose .rdd, so this operation was unavailable. This PR adds Dataset.getNumPartitions directly on the Connect client Dataset.

Design choices:

  • Method name: getNumPartitions — matches classic RDD API, aids migration
  • Placement: directly on Datasetdf.getNumPartitions
  • Server implementation: executedPlan.execute().getNumPartitions — triggers physical planning (file splits, bucket assignments, etc.) and RDD partition construction, but no data scan. This is the same work classic Spark does for rdd.getNumPartitions
  • Protocol: new AnalyzePlan case (field 19 in request, 17 in response) — consistent with isLocal, isStreaming, inputFiles

outputPartitioning.numPartitions was considered as a lighter alternative but rejected: the default SparkPlan.outputPartitioning returns UnknownPartitioning(0) for any operator that does not override it (including ExistingRDD, Expand, full-outer SortMergeOuterJoin, and all third-party operators), which would silently return 0 in those cases.

Why are the changes needed?

df.rdd.getNumPartitions is a commonly used operation in classic Spark for understanding physical partitioning without triggering data scanning. Spark Connect clients have no equivalent, making it harder to migrate workloads from classic Spark to Spark Connect.

Does this PR introduce any user-facing change?

Yes — adds a new method getNumPartitions(): Int on Dataset in both the Scala and Python Spark Connect clients.

How was this patch tested?

  • Server unit test: Added assertion to SparkConnectServiceSuite ("Test schema in analyze response") that validates the full server-side handler path using an embedded SparkSession: repartition(4).getNumPartitions === 4.
  • E2E test: Added assertions to ClientE2ETestSuite ("Dataset inspection"): df.repartition(4).getNumPartitions === 4 and df.coalesce(1).getNumPartitions === 1. These exercise the full client → gRPC → server → response round-trip.
  • Python test: Added test_get_num_partitions to test_connect_basic.py.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with Claude (Anthropic).

@andreAmorimF andreAmorimF changed the title [SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client [SPARK-55689][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026
@andreAmorimF andreAmorimF changed the title [SPARK-55689][CONNECT] Add Dataset.getNumPartitions to Spark Connect client Add Dataset.getNumPartitions to Spark Connect client May 5, 2026
@andreAmorimF andreAmorimF changed the title Add Dataset.getNumPartitions to Spark Connect client [SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026
@andreAmorimF andreAmorimF marked this pull request as draft May 5, 2026 17:28
@andreAmorimF andreAmorimF changed the title [SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client [SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026
@andreAmorimF andreAmorimF force-pushed the feature/spark-connect-get-num-partitions branch from f799171 to 4b8f987 Compare May 5, 2026 18:30
@andreAmorimF andreAmorimF marked this pull request as ready for review May 5, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant