Skip to content

Low performance of setkey function for version 1.12.0 and later #4788

@oliviermeslin

Description

@oliviermeslin

tl;dr

The setkey function is much slower in all versions of data.table from 1.12.0.

Summary

Context: I manipulate large datasets (~50 millions rows, 50 columns) with data.table on a daily basis. I work with three different computers : an old legacy Windows 2008 server, a more Windows 10 recent server, and my local computer. The available versions of R and data.table differ significantly in each setting.

Problem: I noticed several times that the speed of the setkey function varies considerably depending on the setting I work in : for one of the datasets I work with (54 millions rows with a key uniquely identifying each row), the setkey call may take 2 seconds or 13 minutes.

To make sure where this came from, I ran the same code with several versions of data.table, from 1.10.4 to 1.13.2 in the three settings. The code and all sessions info are below. I found every time the same result : the versions older than or equal to 1.11.8 are very fast, and later versions are much slower (approximately from 200 to 400 times).

Results

In this table, I put the results of the execution time of setkey on a fake dataset (5 millions rows), measured with system.time().

R version data.table version user time system time elapsed time
New server
3.6.3 1.10.4.3 5.72 0.15 5.73
3.6.3 1.11.0 5.52 0.15 5.53
3.6.3 1.11.2 5.54 0.14 5.55
3.6.3 1.11.4 5.64 0.11 5.61
3.6.3 1.11.6 5.48 0.19 5.55
3.6.3 1.11.8 5.54 0.23 5.64
3.6.3 1.12.0 11.67 19.15 76.69
3.6.3 1.12.8 8.58 10.78 65.73
3.6.3 1.13.2 10.55 24.20 70.95
Legacy server
3.3.3 1.10.4.3 3.57 0.08 3.52
3.3.3 1.11.0 3.61 0.07 3.56
3.3.3 1.11.2 3.37 0.17 3.42
3.3.3 1.11.4 3.55 0.03 3.47
3.3.3 1.11.6 3.45 0.06 3.38
3.3.3 1.11.8 3.46 0.06 3.43
3.3.3 1.12.0 8.47 16.19 39.59
3.3.3 1.12.8 8.81 16.91 39.54
3.3.3 1.13.2 8.14 13.61 39.49
Local computer
3.5.3 1.10.4.3 4.15 0.22 4.29
3.5.3 1.11.0 4.28 0.04 4.17
3.5.3 1.11.2 4.49 0.06 4.62
3.5.3 1.11.4 4.35 0.12 4.40
3.5.3 1.11.6 4.29 0.08 4.26
3.5.3 1.11.8 4.21 0.06 4.21
3.5.3 1.12.0 13.34 17.22 31.16
3.5.3 1.12.8 12.92 15.73 28.62
3.5.3 1.13.2 12.77 16.28 29.02

Code

This code installs several versions of data.table in separate libraries, and measures the execution time of setkey on an artificial dataset.

# The directory you use for the tests
test_dir <-  # Do not forget to define the directory

# Keep the R version for the final table
Rversion <- paste0(R.version$major, ".", R.version$minor)

# All the versions we will test
versions <- list(
  c(paste0("R", Rversion, "-", "dt1-10-4"), "1.10.4-3"),
  c(paste0("R", Rversion, "-", "dt1-11-0"), "1.11.0"),
  c(paste0("R", Rversion, "-", "dt1-11-2"), "1.11.2"),
  c(paste0("R", Rversion, "-", "dt1-11-4"), "1.11.4"),
  c(paste0("R", Rversion, "-", "dt1-11-6"), "1.11.6"),
  c(paste0("R", Rversion, "-", "dt1-11-8"), "1.11.8"),
  c(paste0("R", Rversion, "-", "dt1-12-0"), "1.12.0"),
  c(paste0("R", Rversion, "-", "dt1-12-8"), "1.12.8"),
  c(paste0("R", Rversion, "-", "dt1-13-2"), "1.13.2")
)

#############################
# Part 1: installing all data.table versions


# Function installing all versions of data.table in separate temporary libraries
install_version_dt <- function(infos) {
  
  package_lib <- paste0(test_dir, infos[1])
  package_version <- infos[2]
  
  # Create temporary library
  try(unlink(package_lib, recursive = TRUE))
  dir.create(package_lib)
  
  # Install package version
  devtools::install_version("data.table", version = package_version, lib = package_lib)
  try(unloadNamespace(data.table))
}

# Install all data.table versions
lapply(versions, install_version_dt)


#############################
# Part 2: measuring execution time of setkey

# Function measuring the execution time of setkey on artificial data
# with different versions of data.table
test_version_dt <- function(package_version, Rversion) {
  
  # Keep the old library paths
  old_libpath <- .libPaths()
  
  adresse_lib <- paste0(test_dir, package_version)
  .libPaths(adresse_lib)
  print(packageVersion("data.table"))
  dt_version <- packageVersion("data.table")
  
  library("data.table")    
  print(.libPaths())

  
  set.seed(1L)
  dt <- data.table::data.table(
    x = as.character(sample(5e6L, 5e6L, FALSE)), 
    y = runif(100L))
  
  results <- system.time(
    {
      data.table::setkey(dt, x, verbose = TRUE)
    }
  )
  
  # Make sure we unload the package
  try(unloadNamespace(data.table))
  try(detach('package:data.table', unload = TRUE))

  # Restore the old library paths
  .libPaths(old_libpath)
  print(.libPaths())
  
  
  return(
    list(
      "Rversion"   = Rversion, 
      "dt_version" = as.character(dt_version),
      "user.self"  = as.numeric(results["user.self"]),
      "sys.self"   = as.numeric(results["sys.self"]),
      "elapsed"    = as.numeric(results["elapsed"])))
  
}

# make the list of all temporary libraries
folder_list <- 
  c(
    unlist(lapply(versions, function(x) return(x[1])))
  )

results_list <- lapply(folder_list, test_version_dt, Rversion)

#############################
# Part 3: summarizing results

results_df <- data.table::rbindlist(results_list)
print(results_df)

Session Infos

Old Windows 2008 legacy server

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.3

New Windows 10 server

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 parallel_3.6.3 tools_3.6.3    Rcpp_1.0.5     fst_0.9.4 

Local computer

R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        rstudioapi_0.11   magrittr_1.5      usethis_1.5.1    
 [5] devtools_2.2.2    pkgload_1.0.2     R6_2.4.1          rlang_0.4.6      
 [9] fansi_0.4.1       tools_3.5.3       pkgbuild_1.0.6    data.table_1.13.2
[13] sessioninfo_1.1.1 cli_2.0.2         withr_2.1.2       ellipsis_0.3.1   
[17] remotes_2.1.1     yaml_2.2.1        assertthat_0.2.1  digest_0.6.25    
[21] rprojroot_1.3-2   crayon_1.3.4      processx_3.4.2    callr_3.4.2      
[25] fs_1.3.1          ps_1.3.2          testthat_2.3.1    memoise_1.1.0    
[29] glue_1.4.1        compiler_3.5.3    desc_1.2.0        backports_1.1.5  
[33] prettyunits_1.1.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions