-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi, I found a very weird behavior that could be critical for all other data.table operation. When you have a large data set, and try to take the sum of a binary indicator by a group indicator, you will get the inconsistent and wrong results.
# [Minimal reproducible example] here's the code
N = 19926003
dt = data.table(
id = as.integer64(sample(1:3000,N,replace=TRUE)),
X = ifelse(rnorm(N) > 1, 1, 0))
dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]
When you run this code, you get the following output.
> dt[id == 2, sum(X)]
[1] 1069
> sum_dt[id==2,]
id sumX
1: 2 1081
These two numbers must be the same, but different. But interestingly, the following code will give you the correct results.
> unique(dt[, sumX := sum(X),by='id'][, c('id','sumX')])[id==2,]
id sumX
1: 2 1069
That said, actually you can fix this issue by setting your data type as integer, for example,
> dt = data.table(
+ id = as.integer64(sample(1:3000,N,replace=TRUE)),
+ X = as.integer(ifelse(rnorm(N) > 1, 1, 0)))
>
> dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]
[1] 1108
> sum_dt = dt[, .(sumX = sum(X)),by='id']
> sum_dt[id==2,]
id sumX
1: 2 1108
I start to make it very explicit about the variable type in my code, but I think that it is a big issue given that people in general use numeric as a default data type.
# Output of sessionInfo()
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP1
Matrix products: default
BLAS: /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bit64_4.0.5 bit_4.0.4 data.table_1.14.0
loaded via a namespace (and not attached):
[1] compiler_4.0.3