[SPARK-53504][SQL] Type framework #51467

MaxGekk · 2025-07-13T18:02:31Z

What changes were proposed in this pull request?

In the PR, I propose to refactor internal operations over Catalyst's data type and introduce new Type framework as a set of interfaces. Every data type should have a companion Ops object which implements type specific interfaces. This PR adds only one ops object of the TimeType data type which implements the following interfaces:

PhyTypeOps - operations over physical underlying type. For example, the physical type of TimeType is Long.
LiteralTypeOps - operations over type values as literals in SQL/Java.
ExternalTypeOps - conversions to external types. In the case of TimeType, the ops object implements conversions to/from java.time.LocalTime.
FormatTypeOps - format type values as strings.
EncodeTypeOps - serialization of row values from/to specific objects.

Client and server-side `Ops` objects

Some interfaces are useful on the client side of Spark connect and can be implemented only inside of the spark-api package because the implementation requires internal functions/classes of the package. As consequence of that, Ops objects are split by two objects: a client Ops object called ApiOps (because of the name of the package) and a Ops object at the server side. For example, the TimeType data type has two companion Ops: the TimeTypeApiOps class in spark-api and the case class TimeTypeOps.

Why are the changes needed?

In fact Catalyst's data type are handled in ad-hoc way, and processing logic are distributed across entire code base of Spark SQL. According to rough estimates there are more then 100 places where need to handle new data type. For example, DayTimeIntervalType is matched in:

$ find . -name "*.scala" -print0|xargs -0 grep case|grep '=>'|grep DayTimeIntervalType|grep -v test|wc -l
     133

Here is the one of examples, see the link:

  def default(dataType: DataType): Literal = dataType match {
...
    case DateType => create(0, DateType)
    case TimestampType => create(0L, TimestampType)

Such approach is error prone because there is high chance of to miss handling particular data type. The mistake can be found only in runtime by getting user-facing error. And the compiler cannot help in such cases.

The type framework encapsulates data type specific operations only in a couple ops classes, and allow Spark devs focus on their implementation but not on the handling sides.
In the future, Spark SQL users can build their own data types using SQL. This type framework could be considered as the foundation of the feature.

Does this PR introduce any user-facing change?

No. This is just refactoring.

How was this patch tested?

By running the affected test suites:

$ build/sbt "test:testOnly *RowEncoderSuite"
$ build/sbt "test:testOnly *CatalystTypeConvertersSuite"
$ build/sbt "test:testOnly *HiveResultSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

MaxGekk · 2025-07-14T14:06:26Z

@milastdbx @mkaravel @uros-db Please, have a look at this prototype.

MaxGekk · 2025-07-14T14:07:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/EncoderUtils.scala


  def dataTypeJavaClass(dt: DataType): Class[_] = {
    dt match {
+      case _ if PhyTypeOps.supports(dt) => PhyTypeOps(dt).getJavaClass


In the future, we will get an Ops object here, and we will match by it instead of PhyTypeOps.supports. The current implementation is just workaround to avoid passing TypeOps instead of DataType.

MaxGekk · 2025-07-14T14:08:51Z

@cloud-fan Please, take a look at this prototype.

uros-db · 2025-07-16T10:35:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/LiteralTypeOps.scala

+  def supports(dt: DataType): Boolean = dt match {
+    case _: TimeType => true
+    case _ => false
+  }


This looks like duplicate code, compared to PhyTypeOps, e.g. something that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).

This looks like duplicate code, compared to PhyTypeOps

This check in LiteralTypeOps is not duplicate of the code in PhyTypeOps because the code in LiteralTypeOps checks a DataType supports the LiteralTypeOps interface, but the same code in PhyTypeOps checks another interface. I think when we support more DataTypes like TimeType we can replace this pattern matching by a set operation like:

private val supportedDataTypes = Set(AnyTimeType, AnsiIntervalType) def supports(dt: DataType): Boolean = supportedDataTypes.contain(dt)

that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).

Yep, this is the goal. Ideally we should propagate TypeOps objects instead of DataType, but for now it requires tons of changes. Current approach is temporary workaround.

uros-db · 2025-07-16T10:38:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhyTypeOps.scala

+  def supports(dt: DataType): Boolean = dt match {
+    case _: TimeType => true
+    case _ => false
+  }


Even this is somewhat of a duplicate, given that we already kind of match TimeType as part of TypeOps. Ideal scenario would be to do this logic only once, without the need to add TimeType to several matches.

Ideal scenario would be to deal only with TypeOps objects but not with DataType. In that case, you can match them by a required interface, and use its methods like
Create a TypeOps from a DataType -> propagate it everywhere -> at the end match it:

def handle(ops: TypeOps) = ops match { case hashOps: HashTypeOps => ops.sha256(val) case _ => throw an error }

uros-db

Do we have a general idea of how much stuff is actually covered here? Seems to me like this is just a very thin slice (i.e. 5-10%) of all the stuff that needs to be done when adding a new type from scratch. I do think that this is a good general direction, but I am just wondering how reflective this prototype is of the real-world thing. @MaxGekk you have more context in this sense, could you provide a better estimate?

MaxGekk · 2025-07-17T08:38:54Z

Do we have a general idea of how much stuff is actually covered here?

@uros-db The purpose of the prototype is not to cover much, but show that proposed approach is applicable to Spark's code base. This prototype covers <1%, I think.

uros-db · 2025-07-17T08:55:03Z

This prototype covers <1%, I think.

But there are many things that this approach can never cover, e.g. casting rules, type coercion, function support, specific operations related to a particular type (think collated strings for example), etc. So let's exclude this from the equation.

What I'm trying to make sense of here is the following - does this kind of approach actually make a large impact on new data type introduction, or does it just complicate the codebase with only minor advantages. To quantify this decision better, I may be useful to have a reasonable table of approximations, such as:

total time to implement a new data type: 30 weeks of work
implementing stuff that's out of scope for this prototype (casting, coercion, functions): 20 weeks of work
implementing stuff that's in scope for this prototype: 10 weeks of work
total time to implement the full actual type TypeOps framework approach and apply it to TIME: 6 weeks of work
total time to apply the new approach to new data types in the future: 2 weeks of work
total time saved in the future per each new added data type: 8 weeks of work

The numbers here are blind guesses, please provide a better approximation if you have one. I'm just trying to point to the bigger picture here.

MaxGekk · 2025-09-08T09:09:17Z

@cloud-fan Could you take a look at this PR, please. I added and implemented a few more interfaces.

dongjoon-hyun · 2025-09-11T14:54:27Z

sql/api/src/main/scala/org/apache/spark/sql/types/ops/TimeTypeApiOps.scala

+import org.apache.spark.sql.types.TimeType
+
+class TimeTypeApiOps(t: TimeType)
+    extends TypeApiOps


nit. indentation?

@dongjoon-hyun I have to violate the coding style because some GA fails and requires the formatting

$ ./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl sql/connect/shims -pl sql/connect/client/jvm

dongjoon-hyun · 2025-09-11T14:54:51Z

sql/api/src/main/scala/org/apache/spark/sql/types/ops/FormatTypeOps.scala

+  def apply(dt: DataType): FormatTypeOps = TypeApiOps(dt).asInstanceOf[FormatTypeOps]
+}
+
+


nit. extra empty lines.

MaxGekk · 2025-09-20T09:29:33Z

@dongjoon-hyun @holdenk @cloud-fan Could you review this PR, please.

MaxGekk · 2025-12-23T07:28:53Z

@dongjoon-hyun May I ask you to review this PR, please.

davidm-db · 2026-02-09T09:29:05Z

Can we close this PR?

I started working on this effort, I updated the main work item with a more thorough explanation and design doc. I've also created sub-tasks for different implementation phases. I've started working on the initial phase in #54223.

MaxGekk added 3 commits July 13, 2025 19:30

Initial implementation

eef5143

Get Java classes

37e1751

Modify CodeGenerator

579e32e

github-actions bot added the SQL label Jul 13, 2025

MaxGekk added 2 commits July 14, 2025 15:45

Handle mutable values

4215f92

Add LiteralTypeOps

b672cc2

MaxGekk commented Jul 14, 2025

View reviewed changes

uros-db reviewed Jul 16, 2025

View reviewed changes

MaxGekk added 16 commits July 17, 2025 11:47

Improve type checks

0f3bc55

Merge remote-tracking branch 'origin/master' into type-framework-v1

b1a4b0b

Add the Encode interface

97f171e

Implement Apis specific interfaces

8918964

Add APIs ops

7b788a9

Implement Encode ops in RowEncoder

393c803

Fix formatting

55465ae

Add FormatTypeOps

fd12b90

Fix coding style

5dfe0e6

toJavaLiteral

b032e6b

To SQL typed literal

4183e8e

Reuse the Format interface in ToStringBase

f1e2264

Column type

d35e2e3

Change HiveResult

dbf5ef4

Make TimeTypeApiOps serializable

d98676e

External type ops

3848b0b

MaxGekk added 2 commits September 5, 2025 10:35

Refactoring

cdf01b7

Fix an issue

9d9bedc

MaxGekk changed the title ~~[WIP][SQL] Incapsulate type operations~~ [WIP][SPARK-53504][SQL] Type framework Sep 5, 2025

dongjoon-hyun reviewed Sep 11, 2025

View reviewed changes

MaxGekk added 2 commits September 11, 2025 19:35

Remove extra empty lines

8e1cdbd

Merge remote-tracking branch 'origin/master' into type-framework-v1

42bf232

MaxGekk changed the title ~~[WIP][SPARK-53504][SQL] Type framework~~ [SPARK-53504][SQL] Type framework Sep 20, 2025

MaxGekk marked this pull request as ready for review September 20, 2025 09:29

Resolve conflicts

e03fbe5

MaxGekk requested a review from cloud-fan December 22, 2025 14:34

		def apply(dt: DataType): FormatTypeOps = TypeApiOps(dt).asInstanceOf[FormatTypeOps]
		}

[SPARK-53504][SQL] Type framework #51467

Are you sure you want to change the base?

[SPARK-53504][SQL] Type framework #51467

Uh oh!

Conversation

MaxGekk commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Client and server-side Ops objects

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk commented Jul 14, 2025

Uh oh!

MaxGekk Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jul 14, 2025

Uh oh!

uros-db Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

uros-db Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jul 17, 2025

Uh oh!

uros-db commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Sep 8, 2025

Uh oh!

dongjoon-hyun Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Sep 20, 2025

Uh oh!

MaxGekk commented Dec 23, 2025

Uh oh!

davidm-db commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Jul 13, 2025 •

edited

Loading

Client and server-side `Ops` objects

MaxGekk Jul 14, 2025 •

edited

Loading

uros-db commented Jul 17, 2025 •

edited

Loading