[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client by andreAmorimF · Pull Request #55689 · apache/spark

andreAmorimF · 2026-05-05T17:25:06Z

What changes were proposed in this pull request?

Add Dataset.getNumPartitions: Int to the Spark Connect client (Scala and Python), matching the semantics of df.rdd.getNumPartitions in classic Spark.

In classic Spark users call df.rdd.getNumPartitions to get the number of partitions a DataFrame will produce during execution. Spark Connect's client-side Dataset does not expose .rdd, so this operation was unavailable. This PR adds Dataset.getNumPartitions directly on the Connect client Dataset.

Design choices:

Method name: getNumPartitions — matches classic RDD API, aids migration
Placement: directly on Dataset — df.getNumPartitions
Server implementation: executedPlan.execute().getNumPartitions — triggers physical planning (file splits, bucket assignments, etc.) and RDD partition construction, but no data scan. This is the same work classic Spark does for rdd.getNumPartitions
Protocol: new AnalyzePlan case (field 19 in request, 17 in response) — consistent with isLocal, isStreaming, inputFiles

outputPartitioning.numPartitions was considered as a lighter alternative but rejected: the default SparkPlan.outputPartitioning returns UnknownPartitioning(0) for any operator that does not override it (including ExistingRDD, Expand, full-outer SortMergeOuterJoin, and all third-party operators), which would silently return 0 in those cases.

Why are the changes needed?

df.rdd.getNumPartitions is a commonly used operation in classic Spark for understanding physical partitioning without triggering data scanning. Spark Connect clients have no equivalent, making it harder to migrate workloads from classic Spark to Spark Connect.

Does this PR introduce any user-facing change?

Yes — adds a new method getNumPartitions(): Int on Dataset in both the Scala and Python Spark Connect clients.

How was this patch tested?

Server unit test: Added assertion to SparkConnectServiceSuite ("Test schema in analyze response") that validates the full server-side handler path using an embedded SparkSession: repartition(4).getNumPartitions === 4.
E2E test: Added assertions to ClientE2ETestSuite ("Dataset inspection"): df.repartition(4).getNumPartitions === 4 and df.coalesce(1).getNumPartitions === 1. These exercise the full client → gRPC → server → response round-trip.
Python test: Added test_get_num_partitions to test_connect_basic.py.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with Claude (Anthropic).

…eSuite

andreAmorimF changed the title ~~[SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client~~ [SPARK-55689][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026

andreAmorimF changed the title ~~[SPARK-55689][CONNECT] Add Dataset.getNumPartitions to Spark Connect client~~ Add Dataset.getNumPartitions to Spark Connect client May 5, 2026

andreAmorimF changed the title ~~Add Dataset.getNumPartitions to Spark Connect client~~ [SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026

andreAmorimF marked this pull request as draft May 5, 2026 17:28

andreAmorimF changed the title ~~[SPARK-XXXXX][CONNECT] Add Dataset.getNumPartitions to Spark Connect client~~ [SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client May 5, 2026

andreAmorimF added 7 commits May 5, 2026 15:29

proto: add GetNumPartitions to AnalyzePlanRequest/Response

24f90b2

test: add getNumPartitions assertions to Dataset inspection E2E test

5bdb1c7

feat: add Dataset.getNumPartitions client-side implementation

18b05f2

feat: add server-side handler for Dataset.getNumPartitions

e0d787b

feat: add Python client getNumPartitions for Spark Connect DataFrame

e4db9c9

fix: correct proto field numbers (19 for request, 17 for response)

a1097f6

test: add server unit test for getNumPartitions in SparkConnectServic…

4b8f987

…eSuite

andreAmorimF force-pushed the feature/spark-connect-get-num-partitions branch from f799171 to 4b8f987 Compare May 5, 2026 18:30

andreAmorimF marked this pull request as ready for review May 5, 2026 18:40

andreAmorimF added 5 commits May 5, 2026 16:31

Trigger Build

be7029f

Formatting + generating new protos

b357ef6

Fixing proto

e37527a

Fixing compatibility check

4a90d85

Trigger Build

1275641

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689

[SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client#55689
andreAmorimF wants to merge 12 commits intoapache:masterfrom
andreAmorimF:feature/spark-connect-get-num-partitions

andreAmorimF commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreAmorimF commented May 5, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant