[SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps#56059
[SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps#56059MaxGekk wants to merge 9 commits into
Conversation
Register physical types and UnsafeRow/InternalRow accessors for TimestampNTZNanosType and TimestampLTZNanosType so values can be stored and read using the SPIP composite layout (epoch micros + sub-micro nanos).
…anosSuite Wrap the `assertThrows` calls in `invalidNanosWithinMicroNTZ`/`LTZ` so each line stays within the 100-character limit enforced by checkstyle. Co-authored-by: Isaac
…estamps Route nullable TIMESTAMP_NTZ/LTZ nanos fields through UnsafeWriter.write instead of setNull8Bytes so the 16-byte variable-length region is reserved and in-place setTimestampNTZNanos/LTZ remains valid.
…ndian safety writePayload stores nanos via putLong; reading with getShort returned zero on big-endian. Match the write path with getLong and a 16-bit mask.
…timestamps Document unsafe projection support for TimestampNTZNanosType and TimestampLTZNanosType instead of relying only on DatetimeType extending AtomicType. Add a test to pin the contract.
…types Provide defaults in Literal.defaultDefault so planning rules that call Literal.default on TimestampNTZNanosType/TimestampLTZNanosType (e.g. outer- join null replacement, window tolerance) don't throw noDefaultForDataTypeError. Co-authored-by: Isaac
|
@stevomitric Please, have a look at the PR. |
… and UnsafeRow storage Add internal scaladoc and code comments explaining the composite value representation, 16-byte UnsafeRow payload, and deferred ordering support.
…entation Wrap scaladoc and Javadoc lines to satisfy scalastyle and checkstyle.
|
@dongjoon-hyun @cloud-fan @felixcheung @peter-toth @mridulm @sunchao friendly ping for review when you have a moment. This PR wires physical storage for nanosecond timestamp types (TimestampNTZNanos / TimestampLTZNanos) in InternalRow and UnsafeRow, on top of the logical types and parser work already in master. There is no user-facing SQL change in this PR, but it unblocks the rest of the SPARK-56822 SPIP — casts, Parquet read, analysis gating, ordering/hash, and expression parity all need rows that can hold these values. |
stevomitric
left a comment
There was a problem hiding this comment.
Overall looks good - one concern i have here is that we are not following types framework. I found a couple of places where we could integrate - there might be a few more.
| assert(unsafeRow.getLong(0) >>> 32 === offset) | ||
| } | ||
|
|
||
| testBothCodegenAndInterpreted("SPARK-41535: nanos timestamps initialized as null") { |
There was a problem hiding this comment.
The SPARK-41535 relates to a bug for calendar interval types. I think we shouldn't reuse other bug-tickets when introducing a new type.
There was a problem hiding this comment.
The purpose of reference to JIRA tickets is to provide more contexts of why the test is needed at all. But I agree that relation to the calendar interval type might confuse. Let me remove SPARK-41535 from the test title and add more comments.
| case TimestampNTZType => PhysicalLongType | ||
| case CalendarIntervalType => PhysicalCalendarIntervalType | ||
| case _: TimestampNTZNanosType => PhysicalTimestampNTZNanosType | ||
| case _: TimestampLTZNanosType => PhysicalTimestampLTZNanosType |
There was a problem hiding this comment.
This should go through types framework.
| case _: TimestampNTZNanosType => | ||
| (input, v) => input.setTimestampNTZNanos(ordinal, v.asInstanceOf[TimestampNTZNanos]) | ||
| case _: TimestampLTZNanosType => | ||
| (input, v) => input.setTimestampLTZNanos(ordinal, v.asInstanceOf[TimestampLTZNanos]) |
There was a problem hiding this comment.
We can integrate this into types as well.
I did that intentionally to: 1. minimize the changes in this PR, 2. Integration to the Type framework could be done later in parallel to other tasks that require functions/APIs introduced by this PR. |
stevomitric
left a comment
There was a problem hiding this comment.
I did that intentionally to: 1. minimize the changes in this PR, 2. Integration to the Type framework could be done later in parallel to other tasks that require functions/APIs introduced by this PR.
I see. In that case LGTM.
…s timestamps Rename the test and document its purpose with reference to the CalendarInterval null-init regression (SPARK-41535).
What changes were proposed in this pull request?
This PR implements the physical row layer for nanosecond-capable timestamp types, as defined in SPARK-56822 SPIP: Timestamps with nanosecond precision.
Logical types
TimestampNTZNanosType(p)andTimestampLTZNanosType(p)(p in [7, 9]) were added in #55952; they still mapped toUninitializedPhysicalType, so values could not be stored or read fromInternalRow/UnsafeRow. This change wires up the minimum physical infrastructure that downstream work (casts, Parquet, expressions) can depend on.SPIP internal representation
Per the SPIP, a value is a composite of:
Logical defaultSize remains 10 bytes on the types. In UnsafeRow, values use the same variable-length pattern as CalendarInterval: an 8-byte field slot (offset + size) pointing to a 16-byte aligned payload (epochMicros + nanosWithinMicro with padding), so in-place updates remain possible.
Implementation summary
TimestampNTZNanos- physical value forTIMESTAMP_NTZ(p)TimestampLTZNanos- physical value forTIMESTAMP_LTZ(p)PhysicalTimestampNTZNanosType,PhysicalTimestampLTZNanosTyperegistered inPhysicalDataType.applyDefaultInternalRow,UnsafeRow,UnsafeArrayData, codegen (CodeGenerator,InterpretedUnsafeProjection,SpecializedGettersReader), and literal validationColumnVectorstubs throwSparkUnsupportedOperationExceptionuntil columnar support is addedWhy are the changes needed?
Without a concrete physical type and
UnsafeRowaccessors, any code path that materializes rows for nanosecond timestamps fails or falls through toUninitializedPhysicalType. This is the unblocker for the rest of sub-tasks.Does this PR introduce any user-facing change?
No. Logical types exist but are not yet exposed through SQL; behavior of
TimestampType,TimestampNTZType, and microsecond storage is unchanged.How was this patch tested?
unsafe/testOnly *TimestampNanosSuite— unsafe value equality, hashCode, validation (nanosWithinMicro∈ [0, 999])catalyst/testOnly *TimestampNanosRowSuite—GenericInternalRowandUnsafeRowroundtrips (NTZ + LTZ, null/non-null), codegen and interpreted unsafe projection, literal validationDataTypeSuite—PhysicalDataTypeis notUninitializedPhysicalTypefor p in {7, 8, 9};defaultSizeremains 10Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor Auto