Skip to content

sum operation of a numeric column by group in a large data set produces incorrect results. #5066

@letitbk

Description

@letitbk

Hi, I found a very weird behavior that could be critical for all other data.table operation. When you have a large data set, and try to take the sum of a binary indicator by a group indicator, you will get the inconsistent and wrong results.

# [Minimal reproducible example] here's the code

N = 19926003
dt = data.table(
	id = as.integer64(sample(1:3000,N,replace=TRUE)), 
	X = ifelse(rnorm(N) > 1, 1, 0))

dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]

When you run this code, you get the following output.

> dt[id == 2, sum(X)]
[1] 1069
> sum_dt[id==2,]
   id sumX
1:  2 1081

These two numbers must be the same, but different. But interestingly, the following code will give you the correct results.

> unique(dt[, sumX := sum(X),by='id'][, c('id','sumX')])[id==2,]
   id sumX
1:  2 1069

That said, actually you can fix this issue by setting your data type as integer, for example,

> dt = data.table(
+   id = as.integer64(sample(1:3000,N,replace=TRUE)), 
+   X = as.integer(ifelse(rnorm(N) > 1, 1, 0)))
> 
> dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]
[1] 1108
> sum_dt = dt[, .(sumX = sum(X)),by='id']
> sum_dt[id==2,]
   id sumX
1:  2 1108

I start to make it very explicit about the variable type in my code, but I think that it is a big issue given that people in general use numeric as a default data type.

# Output of sessionInfo()

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP1

Matrix products: default
BLAS:   /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bit64_4.0.5       bit_4.0.4         data.table_1.14.0

loaded via a namespace (and not attached):
[1] compiler_4.0.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    GForceissues relating to optimized grouping calculations (GForce)bugnot reproducible

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions