Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 13, 2025

What changes were proposed in this pull request?

In the PR, I propose to refactor internal operations over Catalyst's data type and introduce new Type framework as a set of interfaces. Every data type should have a companion Ops object which implements type specific interfaces. This PR adds only one ops object of the TimeType data type which implements the following interfaces:

  1. PhyTypeOps - operations over physical underlying type. For example, the physical type of TimeType is Long.
  2. LiteralTypeOps - operations over type values as literals in SQL/Java.
  3. ExternalTypeOps - conversions to external types. In the case of TimeType, the ops object implements conversions to/from java.time.LocalTime.
  4. FormatTypeOps - format type values as strings.
  5. EncodeTypeOps - serialization of row values from/to specific objects.

Client and server-side Ops objects

Some interfaces are useful on the client side of Spark connect and can be implemented only inside of the spark-api package because the implementation requires internal functions/classes of the package. As consequence of that, Ops objects are split by two objects: a client Ops object called ApiOps (because of the name of the package) and a Ops object at the server side. For example, the TimeType data type has two companion Ops: the TimeTypeApiOps class in spark-api and the case class TimeTypeOps.

Why are the changes needed?

In fact Catalyst's data type are handled in ad-hoc way, and processing logic are distributed across entire code base of Spark SQL. According to rough estimates there are more then 100 places where need to handle new data type. For example, DayTimeIntervalType is matched in:

$ find . -name "*.scala" -print0|xargs -0 grep case|grep '=>'|grep DayTimeIntervalType|grep -v test|wc -l
     133

Here is the one of examples, see the link:

  def default(dataType: DataType): Literal = dataType match {
...
    case DateType => create(0, DateType)
    case TimestampType => create(0L, TimestampType)

Such approach is error prone because there is high chance of to miss handling particular data type. The mistake can be found only in runtime by getting user-facing error. And the compiler cannot help in such cases.

  • The type framework encapsulates data type specific operations only in a couple ops classes, and allow Spark devs focus on their implementation but not on the handling sides.
  • In the future, Spark SQL users can build their own data types using SQL. This type framework could be considered as the foundation of the feature.

Does this PR introduce any user-facing change?

No. This is just refactoring.

How was this patch tested?

By running the affected test suites:

$ build/sbt "test:testOnly *RowEncoderSuite"
$ build/sbt "test:testOnly *CatalystTypeConvertersSuite"
$ build/sbt "test:testOnly *HiveResultSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 13, 2025
@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 14, 2025

@milastdbx @mkaravel @uros-db Please, have a look at this prototype.


def dataTypeJavaClass(dt: DataType): Class[_] = {
dt match {
case _ if PhyTypeOps.supports(dt) => PhyTypeOps(dt).getJavaClass
Copy link
Member Author

@MaxGekk MaxGekk Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we will get an Ops object here, and we will match by it instead of PhyTypeOps.supports. The current implementation is just workaround to avoid passing TypeOps instead of DataType.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 14, 2025

@cloud-fan Please, take a look at this prototype.

Comment on lines 30 to 33
def supports(dt: DataType): Boolean = dt match {
case _: TimeType => true
case _ => false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplicate code, compared to PhyTypeOps, e.g. something that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplicate code, compared to PhyTypeOps

This check in LiteralTypeOps is not duplicate of the code in PhyTypeOps because the code in LiteralTypeOps checks a DataType supports the LiteralTypeOps interface, but the same code in PhyTypeOps checks another interface. I think when we support more DataTypes like TimeType we can replace this pattern matching by a set operation like:

private val supportedDataTypes = Set(AnyTimeType, AnsiIntervalType)
def supports(dt: DataType): Boolean = supportedDataTypes.contain(dt)

that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).

Yep, this is the goal. Ideally we should propagate TypeOps objects instead of DataType, but for now it requires tons of changes. Current approach is temporary workaround.

Comment on lines 36 to 39
def supports(dt: DataType): Boolean = dt match {
case _: TimeType => true
case _ => false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even this is somewhat of a duplicate, given that we already kind of match TimeType as part of TypeOps. Ideal scenario would be to do this logic only once, without the need to add TimeType to several matches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideal scenario would be to deal only with TypeOps objects but not with DataType. In that case, you can match them by a required interface, and use its methods like
Create a TypeOps from a DataType -> propagate it everywhere -> at the end match it:

def handle(ops: TypeOps) = ops match {
  case hashOps: HashTypeOps => ops.sha256(val)
  case _ => throw an error
}

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a general idea of how much stuff is actually covered here? Seems to me like this is just a very thin slice (i.e. 5-10%) of all the stuff that needs to be done when adding a new type from scratch. I do think that this is a good general direction, but I am just wondering how reflective this prototype is of the real-world thing. @MaxGekk you have more context in this sense, could you provide a better estimate?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 17, 2025

Do we have a general idea of how much stuff is actually covered here?

@uros-db The purpose of the prototype is not to cover much, but show that proposed approach is applicable to Spark's code base. This prototype covers <1%, I think.

@uros-db
Copy link
Contributor

uros-db commented Jul 17, 2025

This prototype covers <1%, I think.

But there are many things that this approach can never cover, e.g. casting rules, type coercion, function support, specific operations related to a particular type (think collated strings for example), etc. So let's exclude this from the equation.

What I'm trying to make sense of here is the following - does this kind of approach actually make a large impact on new data type introduction, or does it just complicate the codebase with only minor advantages. To quantify this decision better, I may be useful to have a reasonable table of approximations, such as:

  • total time to implement a new data type: 30 weeks of work
  • implementing stuff that's out of scope for this prototype (casting, coercion, functions): 20 weeks of work
  • implementing stuff that's in scope for this prototype: 10 weeks of work
  • total time to implement the full actual type TypeOps framework approach and apply it to TIME: 6 weeks of work
  • total time to apply the new approach to new data types in the future: 2 weeks of work
  • total time saved in the future per each new added data type: 8 weeks of work

The numbers here are blind guesses, please provide a better approximation if you have one. I'm just trying to point to the bigger picture here.

@MaxGekk MaxGekk changed the title [WIP][SQL] Incapsulate type operations [WIP][SPARK-53504][SQL] Type framework Sep 5, 2025
@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 8, 2025

@cloud-fan Could you take a look at this PR, please. I added and implemented a few more interfaces.

import org.apache.spark.sql.types.TimeType

class TimeTypeApiOps(t: TimeType)
extends TypeApiOps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. indentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I have to violate the coding style because some GA fails and requires the formatting

$ ./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl sql/connect/shims -pl sql/connect/client/jvm

def apply(dt: DataType): FormatTypeOps = TypeApiOps(dt).asInstanceOf[FormatTypeOps]
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. extra empty lines.

@MaxGekk MaxGekk changed the title [WIP][SPARK-53504][SQL] Type framework [SPARK-53504][SQL] Type framework Sep 20, 2025
@MaxGekk MaxGekk marked this pull request as ready for review September 20, 2025 09:29
@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 20, 2025

@dongjoon-hyun @holdenk @cloud-fan Could you review this PR, please.

@MaxGekk MaxGekk requested a review from cloud-fan December 22, 2025 14:34
@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 23, 2025

@dongjoon-hyun May I ask you to review this PR, please.

@davidm-db
Copy link
Contributor

Can we close this PR?

I started working on this effort, I updated the main work item with a more thorough explanation and design doc. I've also created sub-tasks for different implementation phases. I've started working on the initial phase in #54223.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants