Add TimestampWithOffset canonical extension type#558
Add TimestampWithOffset canonical extension type#558serramatutu wants to merge 15 commits intoapache:mainfrom
TimestampWithOffset canonical extension type#558Conversation
…48002) ### Rationale for this change Closes #44248 Arrow has no built-in canonical way of representing the `TIMESTAMP WITH TIME ZONE` SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new `arrow.timestamp_with_offset` extension type would introduce a standard canonical way of representing that information. Rust implementation: apache/arrow-rs#8743 Go implementation: apache/arrow-go#558 [DISCUSS] [thread in the mailing list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk). ### What changes are included in this PR? Proposal and documentation for `arrow.timestamp_with_offset` canonical extension type. ### Are these changes tested? N/A ### Are there any user-facing changes? Yes, this is an extension to the arrow format. * GitHub Issue: #44248 --------- Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
95230ad to
b9b8bf2
Compare
TimestampWithOffset extension typeTimestampWithOffset extension type
TimestampWithOffset extension typeTimestampWithOffset canonical extension type
b9b8bf2 to
ccdd288
Compare
| values := make([]interface{}, a.Len()) | ||
| a.iterValues(func(i int, timestamp *time.Time) { | ||
| values[i] = timestamp | ||
| }) | ||
| return json.Marshal(values) |
There was a problem hiding this comment.
| values := make([]interface{}, a.Len()) | |
| a.iterValues(func(i int, timestamp *time.Time) { | |
| values[i] = timestamp | |
| }) | |
| return json.Marshal(values) | |
| return json.Marshal(a.Values()) |
There was a problem hiding this comment.
I ended up not being able to do this because I changed iterValues() to return time.Time instead of *time.Time. So we need a check if time.Unix() == 0, meaning it should actually be serialized as null instead of 1970-01-01.
There was a problem hiding this comment.
Using 0 as a sentinel for null seems incorrect as that means you can't actually encode a non-null value of 1970-01-01 and get the correct output. Since we need to be able to represent the difference between a value of 0 and a null, then either iterValues() needs to return a sequence of *time.Time or it needs to be a sequence of iter.Seq2[time.Time, bool] where the bool indicates validity (if the bool is false, then it's null).
| timestamps.UnsafeAppendBoolToBitmap(false) | ||
| offsets.UnsafeAppendBoolToBitmap(false) |
There was a problem hiding this comment.
these are non-nullable I thought? We should be pushing default 0,0 in the case where !valids[i]
There was a problem hiding this comment.
Oops, yea this was reminiscent of the old implementation before we all circled around to this being non-nullable again... Good catch :)
There was a problem hiding this comment.
As we discussed in Slack, I found an issue when roundtripping to JSON.
It looks like replacing these with UnsafeAppend results in an array with nulls=0 in the inner (non-nullable) field.
If I write the same array to JSON and parse it back with arrow.RecordFromJSON(), the inner array has nulls=1, even if the field is not nullable.
This leads to arrow.RecordEqual() returning false due to original.NullN() != roundtrip.NullN().
There is an ongoing thread on the mainling list about how to resolve this, so I'll hold off on fixing it for now.
| timestamps.UnsafeAppendBoolToBitmap(false) | ||
| offsets.UnsafeAppendBoolToBitmap(false) |
There was a problem hiding this comment.
same as above, these are non-nullable according to the spec.
| timestamps.UnsafeAppendBoolToBitmap(false) | ||
| offsets.UnsafeAppendBoolToBitmap(false) |
There was a problem hiding this comment.
same as above. these are non-nullable according to the spec
ccdd288 to
cee17a3
Compare
arrow/array/encoded.go
Outdated
| if !v { | ||
| b.finishRun() | ||
| } |
There was a problem hiding this comment.
this seems incorrect, it could be a run of nulls, couldn't it? And wouldn't the run itself be handled by the value append? I don't think we need to call b.finishRun here
| naiveUtc := time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), t.Minute(), t.Second(), t.Nanosecond(), time.UTC) | ||
| offsetMinutes := int16(naiveUtc.Sub(t).Minutes()) |
There was a problem hiding this comment.
can you give me an example where it messes up the values?
Just to verify, the goal here is to take t and provide the same instant of time in UTC while figuring out what the offset was in minutes for the timezone, yes? i.e. given 2009-11-10 15:00:00 -0800 PST this should return 2009-11-10 23:00:00 +0000 UTC and 480, correct?
| values := make([]interface{}, a.Len()) | ||
| a.iterValues(func(i int, timestamp *time.Time) { | ||
| values[i] = timestamp | ||
| }) | ||
| return json.Marshal(values) |
There was a problem hiding this comment.
Using 0 as a sentinel for null seems incorrect as that means you can't actually encode a non-null value of 1970-01-01 and get the correct output. Since we need to be able to represent the difference between a value of 0 and a null, then either iterValues() needs to return a sequence of *time.Time or it needs to be a sequence of iter.Seq2[time.Time, bool] where the bool indicates validity (if the bool is false, then it's null).
| } | ||
| } else { | ||
| for i := 0; i < a.Len(); i++ { | ||
| ts := time.Unix(0, 0) |
There was a problem hiding this comment.
this should be equivalent to just doing var ts time.Time, except faster since it avoids the extra processing of calling the Unix function.
|
|
||
| // Iterates over the array returning the timestamp at each position. | ||
| // | ||
| // If the timestamp is null, the returned time will be the unix epoch. |
There was a problem hiding this comment.
what happens if the value is not-null but is the unix epoch? this iteration would be unable to represent that case
This commit adds a new `TimestampWithOffset` extension type that can be used to represent timestamps with per-row timezone information. It stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`, where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which represents the offset in minutes from the UTC timestamp.
This commit allows `TimestampWithOffset` to be dict-encoded. - I made `NewTimestampWithOffsetType` take in an input `offsetType arrow.DataType`. It returns an error if the data type is not valid. - I added a new infallible `NewTimestampWithOffsetTypePrimitiveEncoded` to make the encoding explicit. - I added `NewTimestampWithOffsetTypeDictionaryEncoded` which returns an error in case the given type is not a valid dictionary key type. - I made all tests run in a for loop with all possible allowed encoding types, ensuring all encodings work.
Smartly iterate over offsets if they're run-end encoded instead of doing a binary search at every iteration. This makes the loops O(n) instead of O(n*logn).
Changed a lot of things based on Matt's suggestions.
These builder implementations were inheriting the default implementation from `Builder`, which does not bump the length of the inner builders. This would leave the builder in an inconsistent state where the top-level builder had the correct length but the inner builder doesn't.
This reverts commit 5bdd53f.
2762d64 to
0249806
Compare
Which issue does this PR close?
This PR implements the new
arrow.timestamp_with_offsetcanonical extension type forarrow-go.Rationale for this change
Be compatible with Arrow spec.
What changes are included in this PR?
This commit adds a new
TimestampWithOffsetextension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, this is a new canonical extension type.