Skip to content

[SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps#56059

Open
MaxGekk wants to merge 9 commits into
apache:masterfrom
MaxGekk:nanos-in-rows
Open

[SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps#56059
MaxGekk wants to merge 9 commits into
apache:masterfrom
MaxGekk:nanos-in-rows

Conversation

@MaxGekk
Copy link
Copy Markdown
Member

@MaxGekk MaxGekk commented May 22, 2026

What changes were proposed in this pull request?

This PR implements the physical row layer for nanosecond-capable timestamp types, as defined in SPARK-56822 SPIP: Timestamps with nanosecond precision.
Logical types TimestampNTZNanosType(p) and TimestampLTZNanosType(p) (p in [7, 9]) were added in #55952; they still mapped to UninitializedPhysicalType, so values could not be stored or read from InternalRow / UnsafeRow. This change wires up the minimum physical infrastructure that downstream work (casts, Parquet, expressions) can depend on.

SPIP internal representation

Per the SPIP, a value is a composite of:

  • Epoch microseconds (long, 8 bytes) — same proleptic-Gregorian epoch microsecond count as existing microsecond timestamps
  • Nanoseconds within that microsecond (short, 0-999) — sub-micro fractional part, not a full nanosecond offset from epoch

Logical defaultSize remains 10 bytes on the types. In UnsafeRow, values use the same variable-length pattern as CalendarInterval: an 8-byte field slot (offset + size) pointing to a 16-byte aligned payload (epochMicros + nanosWithinMicro with padding), so in-place updates remain possible.

Implementation summary

  • Unsafe value types (distinct classes for NTZ vs LTZ semantics at the Java layer; shared byte layout):
    • TimestampNTZNanos - physical value for TIMESTAMP_NTZ(p)
    • TimestampLTZNanos - physical value for TIMESTAMP_LTZ(p)
  • Physical types: PhysicalTimestampNTZNanosType, PhysicalTimestampLTZNanosType registered in PhysicalDataType.applyDefault
  • Row access: specialized getters/setters on InternalRow, UnsafeRow, UnsafeArrayData, codegen (CodeGenerator, InterpretedUnsafeProjection, SpecializedGettersReader), and literal validation
  • Columnar: ColumnVector stubs throw SparkUnsupportedOperationException until columnar support is added

Why are the changes needed?

Without a concrete physical type and UnsafeRow accessors, any code path that materializes rows for nanosecond timestamps fails or falls through to UninitializedPhysicalType. This is the unblocker for the rest of sub-tasks.

Does this PR introduce any user-facing change?

No. Logical types exist but are not yet exposed through SQL; behavior of TimestampType, TimestampNTZType, and microsecond storage is unchanged.

How was this patch tested?

  • unsafe/testOnly *TimestampNanosSuite — unsafe value equality, hashCode, validation (nanosWithinMicro ∈ [0, 999])
  • catalyst/testOnly *TimestampNanosRowSuiteGenericInternalRow and UnsafeRow roundtrips (NTZ + LTZ, null/non-null), codegen and interpreted unsafe projection, literal validation
  • DataTypeSuitePhysicalDataType is not UninitializedPhysicalType for p in {7, 8, 9}; defaultSize remains 10

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor Auto

MaxGekk added 6 commits May 22, 2026 09:59
Register physical types and UnsafeRow/InternalRow accessors for
TimestampNTZNanosType and TimestampLTZNanosType so values can be stored
and read using the SPIP composite layout (epoch micros + sub-micro nanos).
…anosSuite

Wrap the `assertThrows` calls in `invalidNanosWithinMicroNTZ`/`LTZ` so
each line stays within the 100-character limit enforced by checkstyle.

Co-authored-by: Isaac
…estamps

Route nullable TIMESTAMP_NTZ/LTZ nanos fields through UnsafeWriter.write
instead of setNull8Bytes so the 16-byte variable-length region is reserved
and in-place setTimestampNTZNanos/LTZ remains valid.
…ndian safety

writePayload stores nanos via putLong; reading with getShort returned zero on
big-endian. Match the write path with getLong and a 16-bit mask.
…timestamps

Document unsafe projection support for TimestampNTZNanosType and
TimestampLTZNanosType instead of relying only on DatetimeType extending
AtomicType. Add a test to pin the contract.
…types

Provide defaults in Literal.defaultDefault so planning rules that call
Literal.default on TimestampNTZNanosType/TimestampLTZNanosType (e.g. outer-
join null replacement, window tolerance) don't throw noDefaultForDataTypeError.

Co-authored-by: Isaac
@MaxGekk MaxGekk changed the title [WIP][SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps [SPARK-56981][SQL] Add physical representation and UnsafeRow support for nanosecond timestamps May 22, 2026
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 23, 2026

@stevomitric Please, have a look at the PR.

MaxGekk added 2 commits May 24, 2026 11:52
… and UnsafeRow storage

Add internal scaladoc and code comments explaining the composite value
representation, 16-byte UnsafeRow payload, and deferred ordering support.
…entation

Wrap scaladoc and Javadoc lines to satisfy scalastyle and checkstyle.
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 25, 2026

@dongjoon-hyun @cloud-fan @felixcheung @peter-toth @mridulm @sunchao friendly ping for review when you have a moment.

This PR wires physical storage for nanosecond timestamp types (TimestampNTZNanos / TimestampLTZNanos) in InternalRow and UnsafeRow, on top of the logical types and parser work already in master.

There is no user-facing SQL change in this PR, but it unblocks the rest of the SPARK-56822 SPIP — casts, Parquet read, analysis gating, ordering/hash, and expression parity all need rows that can hold these values.

Copy link
Copy Markdown
Contributor

@stevomitric stevomitric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good - one concern i have here is that we are not following types framework. I found a couple of places where we could integrate - there might be a few more.

assert(unsafeRow.getLong(0) >>> 32 === offset)
}

testBothCodegenAndInterpreted("SPARK-41535: nanos timestamps initialized as null") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPARK-41535 relates to a bug for calendar interval types. I think we shouldn't reuse other bug-tickets when introducing a new type.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of reference to JIRA tickets is to provide more contexts of why the test is needed at all. But I agree that relation to the calendar interval type might confuse. Let me remove SPARK-41535 from the test title and add more comments.

case TimestampNTZType => PhysicalLongType
case CalendarIntervalType => PhysicalCalendarIntervalType
case _: TimestampNTZNanosType => PhysicalTimestampNTZNanosType
case _: TimestampLTZNanosType => PhysicalTimestampLTZNanosType
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go through types framework.

case _: TimestampNTZNanosType =>
(input, v) => input.setTimestampNTZNanos(ordinal, v.asInstanceOf[TimestampNTZNanos])
case _: TimestampLTZNanosType =>
(input, v) => input.setTimestampLTZNanos(ordinal, v.asInstanceOf[TimestampLTZNanos])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can integrate this into types as well.

@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 25, 2026

one concern i have here is that we are not following types framework.

I did that intentionally to: 1. minimize the changes in this PR, 2. Integration to the Type framework could be done later in parallel to other tasks that require functions/APIs introduced by this PR.

Copy link
Copy Markdown
Contributor

@stevomitric stevomitric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that intentionally to: 1. minimize the changes in this PR, 2. Integration to the Type framework could be done later in parallel to other tasks that require functions/APIs introduced by this PR.

I see. In that case LGTM.

…s timestamps

Rename the test and document its purpose with reference to the CalendarInterval
null-init regression (SPARK-41535).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants