Skip to content

parquet does not retain haven::tagged_na() #49149

@skolenik

Description

@skolenik

Describe the bug, including details regarding any error messages, version, and platform.

It appears that arrow::write_parquet() does not preserve haven::tagged_na() missing values, and converts them to regular missing values instead (loss of data).

library(arrow)
library(haven)
library(dplyr)
library(labelled)

this_pq <- tempfile(fileext = "parquet")

mydf <- data.frame(x = c(1,NA, haven::tagged_na("a")))
mydf

# save as parquet and reopen
arrow::write_dataset(dataset = mydf, path = this_pq)
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)

# expected result
is_tagged_na(mydf$x, "a")

# actual
is_tagged_na(mydf2$x, "a")
is_regular_na(mydf2$x)

packageVersion("arrow")

Output:

> # expected result 
> is_tagged_na(mydf$x, "a")
[1] FALSE FALSE  TRUE 
> # actual 
> is_tagged_na(mydf2$x, "a") 
[1] FALSE FALSE FALSE 
> is_regular_na(mydf2$x) 
> [1] FALSE  TRUE  TRUE
> packageVersion("arrow")
[1] ‘23.0.0’

 
https://haven.tidyverse.org/reference/tagged_na.html

Background: In SAS, the special missing values .a, .b, ..., .z are implemented as near negative infinity. In Stata, the special missing values .a, .b, ..., .z are implemented as near positive infinity. So they are literally reserving a few top values in a given format to be interpreted as special missing values rather than numbers (so for the int8 format, Stata goes from -127 to 100, with the value of 101 being interpreted as .a, ... 126 as .z and 127 as NA, see https://www.stata.com/help.cgi?datatypes and https://www.stata.com/help.cgi?missing.) What the implementation is in haven, I don't really know (the labels are implemented as attributes() and are more or less retained, see a somewhat extended reprex below). The main value-added of the whole concept is that you can distinguish the reasons for missing values with labels such as haven::labelled(your_numeric_vector, labels = c("Don't know" = tagged_na("d"), "Refused" = tagged_na("r"), "Valid skip" = tagged_na("s"), "Not in universe" = tagged_na("u") ) ).

Labels are OK-ish:

mydf <- data.frame(x = labelled(c(1,NA, haven::tagged_na("a")), labels = c("Blah" = 1, "aaa" = tagged_na("a"))))
arrow::write_dataset(dataset = mydf, path = this_pq, format="parquet")
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)
get_value_labels(mydf$x) |> labelled::print_tagged_na()
get_value_labels(mydf2$x) |> labelled::print_tagged_na()

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions