-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-53504][SQL] Type framework #51467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@milastdbx @mkaravel @uros-db Please, have a look at this prototype. |
|
|
||
| def dataTypeJavaClass(dt: DataType): Class[_] = { | ||
| dt match { | ||
| case _ if PhyTypeOps.supports(dt) => PhyTypeOps(dt).getJavaClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we will get an Ops object here, and we will match by it instead of PhyTypeOps.supports. The current implementation is just workaround to avoid passing TypeOps instead of DataType.
|
@cloud-fan Please, take a look at this prototype. |
| def supports(dt: DataType): Boolean = dt match { | ||
| case _: TimeType => true | ||
| case _ => false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like duplicate code, compared to PhyTypeOps, e.g. something that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like duplicate code, compared to PhyTypeOps
This check in LiteralTypeOps is not duplicate of the code in PhyTypeOps because the code in LiteralTypeOps checks a DataType supports the LiteralTypeOps interface, but the same code in PhyTypeOps checks another interface. I think when we support more DataTypes like TimeType we can replace this pattern matching by a set operation like:
private val supportedDataTypes = Set(AnyTimeType, AnsiIntervalType)
def supports(dt: DataType): Boolean = supportedDataTypes.contain(dt)that we wanted to avoid in the first place (adding TimeTime to a bunch of matches).
Yep, this is the goal. Ideally we should propagate TypeOps objects instead of DataType, but for now it requires tons of changes. Current approach is temporary workaround.
| def supports(dt: DataType): Boolean = dt match { | ||
| case _: TimeType => true | ||
| case _ => false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even this is somewhat of a duplicate, given that we already kind of match TimeType as part of TypeOps. Ideal scenario would be to do this logic only once, without the need to add TimeType to several matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideal scenario would be to deal only with TypeOps objects but not with DataType. In that case, you can match them by a required interface, and use its methods like
Create a TypeOps from a DataType -> propagate it everywhere -> at the end match it:
def handle(ops: TypeOps) = ops match {
case hashOps: HashTypeOps => ops.sha256(val)
case _ => throw an error
}
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a general idea of how much stuff is actually covered here? Seems to me like this is just a very thin slice (i.e. 5-10%) of all the stuff that needs to be done when adding a new type from scratch. I do think that this is a good general direction, but I am just wondering how reflective this prototype is of the real-world thing. @MaxGekk you have more context in this sense, could you provide a better estimate?
@uros-db The purpose of the prototype is not to cover much, but show that proposed approach is applicable to Spark's code base. This prototype covers <1%, I think. |
But there are many things that this approach can never cover, e.g. casting rules, type coercion, function support, specific operations related to a particular type (think collated strings for example), etc. So let's exclude this from the equation. What I'm trying to make sense of here is the following - does this kind of approach actually make a large impact on new data type introduction, or does it just complicate the codebase with only minor advantages. To quantify this decision better, I may be useful to have a reasonable table of approximations, such as:
The numbers here are blind guesses, please provide a better approximation if you have one. I'm just trying to point to the bigger picture here. |
|
@cloud-fan Could you take a look at this PR, please. I added and implemented a few more interfaces. |
| import org.apache.spark.sql.types.TimeType | ||
|
|
||
| class TimeTypeApiOps(t: TimeType) | ||
| extends TypeApiOps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I have to violate the coding style because some GA fails and requires the formatting
$ ./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl sql/connect/shims -pl sql/connect/client/jvm
| def apply(dt: DataType): FormatTypeOps = TypeApiOps(dt).asInstanceOf[FormatTypeOps] | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. extra empty lines.
|
@dongjoon-hyun @holdenk @cloud-fan Could you review this PR, please. |
|
@dongjoon-hyun May I ask you to review this PR, please. |
|
Can we close this PR? I started working on this effort, I updated the main work item with a more thorough explanation and design doc. I've also created sub-tasks for different implementation phases. I've started working on the initial phase in #54223. |
What changes were proposed in this pull request?
In the PR, I propose to refactor internal operations over Catalyst's data type and introduce new Type framework as a set of interfaces. Every data type should have a companion
Opsobject which implements type specific interfaces. This PR adds only one ops object of theTimeTypedata type which implements the following interfaces:PhyTypeOps- operations over physical underlying type. For example, the physical type ofTimeTypeisLong.LiteralTypeOps- operations over type values as literals in SQL/Java.ExternalTypeOps- conversions to external types. In the case ofTimeType, the ops object implements conversions to/fromjava.time.LocalTime.FormatTypeOps- format type values as strings.EncodeTypeOps- serialization of row values from/to specific objects.Client and server-side
OpsobjectsSome interfaces are useful on the client side of Spark connect and can be implemented only inside of the
spark-apipackage because the implementation requires internal functions/classes of the package. As consequence of that,Opsobjects are split by two objects: a clientOpsobject calledApiOps(because of the name of the package) and aOpsobject at the server side. For example, theTimeTypedata type has two companionOps: theTimeTypeApiOpsclass inspark-apiand the case classTimeTypeOps.Why are the changes needed?
In fact Catalyst's data type are handled in ad-hoc way, and processing logic are distributed across entire code base of Spark SQL. According to rough estimates there are more then 100 places where need to handle new data type. For example,
DayTimeIntervalTypeis matched in:Here is the one of examples, see the link:
Such approach is error prone because there is high chance of to miss handling particular data type. The mistake can be found only in runtime by getting user-facing error. And the compiler cannot help in such cases.
Does this PR introduce any user-facing change?
No. This is just refactoring.
How was this patch tested?
By running the affected test suites:
Was this patch authored or co-authored using generative AI tooling?
No.