-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
It appears that arrow::write_parquet() does not preserve haven::tagged_na() missing values, and converts them to regular missing values instead (loss of data).
library(arrow)
library(haven)
library(dplyr)
library(labelled)
this_pq <- tempfile(fileext = "parquet")
mydf <- data.frame(x = c(1,NA, haven::tagged_na("a")))
mydf
# save as parquet and reopen
arrow::write_dataset(dataset = mydf, path = this_pq)
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)
# expected result
is_tagged_na(mydf$x, "a")
# actual
is_tagged_na(mydf2$x, "a")
is_regular_na(mydf2$x)
packageVersion("arrow")
Output:
> # expected result
> is_tagged_na(mydf$x, "a")
[1] FALSE FALSE TRUE
> # actual
> is_tagged_na(mydf2$x, "a")
[1] FALSE FALSE FALSE
> is_regular_na(mydf2$x)
> [1] FALSE TRUE TRUE
> packageVersion("arrow")
[1] ‘23.0.0’
https://haven.tidyverse.org/reference/tagged_na.html
Background: In SAS, the special missing values .a, .b, ..., .z are implemented as near negative infinity. In Stata, the special missing values .a, .b, ..., .z are implemented as near positive infinity. So they are literally reserving a few top values in a given format to be interpreted as special missing values rather than numbers (so for the int8 format, Stata goes from -127 to 100, with the value of 101 being interpreted as .a, ... 126 as .z and 127 as NA, see https://www.stata.com/help.cgi?datatypes and https://www.stata.com/help.cgi?missing.) What the implementation is in haven, I don't really know (the labels are implemented as attributes() and are more or less retained, see a somewhat extended reprex below). The main value-added of the whole concept is that you can distinguish the reasons for missing values with labels such as haven::labelled(your_numeric_vector, labels = c("Don't know" = tagged_na("d"), "Refused" = tagged_na("r"), "Valid skip" = tagged_na("s"), "Not in universe" = tagged_na("u") ) ).
Labels are OK-ish:
mydf <- data.frame(x = labelled(c(1,NA, haven::tagged_na("a")), labels = c("Blah" = 1, "aaa" = tagged_na("a"))))
arrow::write_dataset(dataset = mydf, path = this_pq, format="parquet")
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)
get_value_labels(mydf$x) |> labelled::print_tagged_na()
get_value_labels(mydf2$x) |> labelled::print_tagged_na()
Component(s)
R