-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
HighjoinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsperformanceregression
Milestone
Description
I noticed that merging data.tables on a character column is surprisingly slow (compared to dplyr). See the reprex below.
Am I missing something obvious or is there room for improvement? Don't get me wrong: I love data.table and use it daily, but I was very surprised that this kind of merge is so comparatively slow.
Minimal reproducible example
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 1774.640 1774.640 1774.640 1774.640 1774.640 1774.640 1
#> dplyr 35.852 35.852 35.852 35.852 35.852 35.852 1
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 6.3679 6.3679 6.3679 6.3679 6.3679 6.3679 1
#> dplyr 56.9153 56.9153 56.9153 56.9153 56.9153 56.9153 1
dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
dplyr = inner_join(dfa_int, dfb_int, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 5.4953 5.4953 5.4953 5.4953 5.4953 5.4953 1
#> dplyr 14.6955 14.6955 14.6955 14.6955 14.6955 14.6955 1Created on 2020-10-02 by the reprex package (v0.3.0)
Output of sessionInfo()
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.2)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.2)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 3.6.2)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.2)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.2)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.3)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
#> stringi 1.4.4 2020-01-09 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.3)
#> vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Program Files/R/libraryPS: Not sure if this is already adressed by #4538.
Metadata
Metadata
Assignees
Labels
HighjoinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsperformanceregression