-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
tl;dr
The setkey function is much slower in all versions of data.table from 1.12.0.
Summary
Context: I manipulate large datasets (~50 millions rows, 50 columns) with data.table on a daily basis. I work with three different computers : an old legacy Windows 2008 server, a more Windows 10 recent server, and my local computer. The available versions of R and data.table differ significantly in each setting.
Problem: I noticed several times that the speed of the setkey function varies considerably depending on the setting I work in : for one of the datasets I work with (54 millions rows with a key uniquely identifying each row), the setkey call may take 2 seconds or 13 minutes.
To make sure where this came from, I ran the same code with several versions of data.table, from 1.10.4 to 1.13.2 in the three settings. The code and all sessions info are below. I found every time the same result : the versions older than or equal to 1.11.8 are very fast, and later versions are much slower (approximately from 200 to 400 times).
Results
In this table, I put the results of the execution time of setkey on a fake dataset (5 millions rows), measured with system.time().
| R version | data.table version |
user time | system time | elapsed time | |
|---|---|---|---|---|---|
| New server | |||||
| 3.6.3 | 1.10.4.3 | 5.72 | 0.15 | 5.73 | |
| 3.6.3 | 1.11.0 | 5.52 | 0.15 | 5.53 | |
| 3.6.3 | 1.11.2 | 5.54 | 0.14 | 5.55 | |
| 3.6.3 | 1.11.4 | 5.64 | 0.11 | 5.61 | |
| 3.6.3 | 1.11.6 | 5.48 | 0.19 | 5.55 | |
| 3.6.3 | 1.11.8 | 5.54 | 0.23 | 5.64 | |
| 3.6.3 | 1.12.0 | 11.67 | 19.15 | 76.69 | |
| 3.6.3 | 1.12.8 | 8.58 | 10.78 | 65.73 | |
| 3.6.3 | 1.13.2 | 10.55 | 24.20 | 70.95 | |
| Legacy server | |||||
| 3.3.3 | 1.10.4.3 | 3.57 | 0.08 | 3.52 | |
| 3.3.3 | 1.11.0 | 3.61 | 0.07 | 3.56 | |
| 3.3.3 | 1.11.2 | 3.37 | 0.17 | 3.42 | |
| 3.3.3 | 1.11.4 | 3.55 | 0.03 | 3.47 | |
| 3.3.3 | 1.11.6 | 3.45 | 0.06 | 3.38 | |
| 3.3.3 | 1.11.8 | 3.46 | 0.06 | 3.43 | |
| 3.3.3 | 1.12.0 | 8.47 | 16.19 | 39.59 | |
| 3.3.3 | 1.12.8 | 8.81 | 16.91 | 39.54 | |
| 3.3.3 | 1.13.2 | 8.14 | 13.61 | 39.49 | |
| Local computer | |||||
| 3.5.3 | 1.10.4.3 | 4.15 | 0.22 | 4.29 | |
| 3.5.3 | 1.11.0 | 4.28 | 0.04 | 4.17 | |
| 3.5.3 | 1.11.2 | 4.49 | 0.06 | 4.62 | |
| 3.5.3 | 1.11.4 | 4.35 | 0.12 | 4.40 | |
| 3.5.3 | 1.11.6 | 4.29 | 0.08 | 4.26 | |
| 3.5.3 | 1.11.8 | 4.21 | 0.06 | 4.21 | |
| 3.5.3 | 1.12.0 | 13.34 | 17.22 | 31.16 | |
| 3.5.3 | 1.12.8 | 12.92 | 15.73 | 28.62 | |
| 3.5.3 | 1.13.2 | 12.77 | 16.28 | 29.02 |
Code
This code installs several versions of data.table in separate libraries, and measures the execution time of setkey on an artificial dataset.
# The directory you use for the tests
test_dir <- # Do not forget to define the directory
# Keep the R version for the final table
Rversion <- paste0(R.version$major, ".", R.version$minor)
# All the versions we will test
versions <- list(
c(paste0("R", Rversion, "-", "dt1-10-4"), "1.10.4-3"),
c(paste0("R", Rversion, "-", "dt1-11-0"), "1.11.0"),
c(paste0("R", Rversion, "-", "dt1-11-2"), "1.11.2"),
c(paste0("R", Rversion, "-", "dt1-11-4"), "1.11.4"),
c(paste0("R", Rversion, "-", "dt1-11-6"), "1.11.6"),
c(paste0("R", Rversion, "-", "dt1-11-8"), "1.11.8"),
c(paste0("R", Rversion, "-", "dt1-12-0"), "1.12.0"),
c(paste0("R", Rversion, "-", "dt1-12-8"), "1.12.8"),
c(paste0("R", Rversion, "-", "dt1-13-2"), "1.13.2")
)
#############################
# Part 1: installing all data.table versions
# Function installing all versions of data.table in separate temporary libraries
install_version_dt <- function(infos) {
package_lib <- paste0(test_dir, infos[1])
package_version <- infos[2]
# Create temporary library
try(unlink(package_lib, recursive = TRUE))
dir.create(package_lib)
# Install package version
devtools::install_version("data.table", version = package_version, lib = package_lib)
try(unloadNamespace(data.table))
}
# Install all data.table versions
lapply(versions, install_version_dt)
#############################
# Part 2: measuring execution time of setkey
# Function measuring the execution time of setkey on artificial data
# with different versions of data.table
test_version_dt <- function(package_version, Rversion) {
# Keep the old library paths
old_libpath <- .libPaths()
adresse_lib <- paste0(test_dir, package_version)
.libPaths(adresse_lib)
print(packageVersion("data.table"))
dt_version <- packageVersion("data.table")
library("data.table")
print(.libPaths())
set.seed(1L)
dt <- data.table::data.table(
x = as.character(sample(5e6L, 5e6L, FALSE)),
y = runif(100L))
results <- system.time(
{
data.table::setkey(dt, x, verbose = TRUE)
}
)
# Make sure we unload the package
try(unloadNamespace(data.table))
try(detach('package:data.table', unload = TRUE))
# Restore the old library paths
.libPaths(old_libpath)
print(.libPaths())
return(
list(
"Rversion" = Rversion,
"dt_version" = as.character(dt_version),
"user.self" = as.numeric(results["user.self"]),
"sys.self" = as.numeric(results["sys.self"]),
"elapsed" = as.numeric(results["elapsed"])))
}
# make the list of all temporary libraries
folder_list <-
c(
unlist(lapply(versions, function(x) return(x[1])))
)
results_list <- lapply(folder_list, test_version_dt, Rversion)
#############################
# Part 3: summarizing results
results_df <- data.table::rbindlist(results_list)
print(results_df)
Session Infos
Old Windows 2008 legacy server
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.3
New Windows 10 server
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 parallel_3.6.3 tools_3.6.3 Rcpp_1.0.5 fst_0.9.4
Local computer
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.11 magrittr_1.5 usethis_1.5.1
[5] devtools_2.2.2 pkgload_1.0.2 R6_2.4.1 rlang_0.4.6
[9] fansi_0.4.1 tools_3.5.3 pkgbuild_1.0.6 data.table_1.13.2
[13] sessioninfo_1.1.1 cli_2.0.2 withr_2.1.2 ellipsis_0.3.1
[17] remotes_2.1.1 yaml_2.2.1 assertthat_0.2.1 digest_0.6.25
[21] rprojroot_1.3-2 crayon_1.3.4 processx_3.4.2 callr_3.4.2
[25] fs_1.3.1 ps_1.3.2 testthat_2.3.1 memoise_1.1.0
[29] glue_1.4.1 compiler_3.5.3 desc_1.2.0 backports_1.1.5
[33] prettyunits_1.1.1