Skip to content

Merging by character columns is very slow #4733

@matthiasgomolka

Description

@matthiasgomolka

I noticed that merging data.tables on a character column is surprisingly slow (compared to dplyr). See the reprex below.

Am I missing something obvious or is there room for improvement? Don't get me wrong: I love data.table and use it daily, but I was very surprised that this kind of merge is so comparatively slow.

Minimal reproducible example

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2

dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  data.table 1774.640 1774.640 1774.640 1774.640 1774.640 1774.640     1
#>       dplyr   35.852   35.852   35.852   35.852   35.852   35.852     1

setkey(dfa_char, id)
setkey(dfb_char, id)

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr     min      lq    mean  median      uq     max neval
#>  data.table  6.3679  6.3679  6.3679  6.3679  6.3679  6.3679     1
#>       dplyr 56.9153 56.9153 56.9153 56.9153 56.9153 56.9153     1

dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_int, dfb_int, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr     min      lq    mean  median      uq     max neval
#>  data.table  5.4953  5.4953  5.4953  5.4953  5.4953  5.4953     1
#>       dplyr 14.6955 14.6955 14.6955 14.6955 14.6955 14.6955     1

Created on 2020-10-02 by the reprex package (v0.3.0)

Output of sessionInfo()

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-10-02                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
#>  backports        1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  callr            3.4.3   2020-03-28 [1] CRAN (R 3.6.3)
#>  cli              2.0.2   2020-02-28 [1] CRAN (R 3.6.3)
#>  crayon           1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
#>  data.table     * 1.13.0  2020-07-24 [1] CRAN (R 3.6.3)
#>  desc             1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
#>  devtools         2.3.0   2020-04-10 [1] CRAN (R 3.6.3)
#>  digest           0.6.25  2020-02-23 [1] CRAN (R 3.6.3)
#>  dplyr          * 1.0.0   2020-05-29 [1] CRAN (R 3.6.3)
#>  ellipsis         0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 3.6.2)
#>  fansi            0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
#>  fs               1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
#>  generics         0.0.2   2018-11-29 [1] CRAN (R 3.6.2)
#>  glue             1.4.1   2020-05-13 [1] CRAN (R 3.6.3)
#>  highr            0.8     2019-03-20 [1] CRAN (R 3.6.2)
#>  htmltools        0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
#>  knitr            1.28    2020-02-06 [1] CRAN (R 3.6.3)
#>  lifecycle        0.2.0   2020-03-06 [1] CRAN (R 3.6.3)
#>  magrittr         1.5     2014-11-22 [1] CRAN (R 3.6.2)
#>  memoise          1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
#>  microbenchmark * 1.4-7   2019-09-24 [1] CRAN (R 3.6.2)
#>  pillar           1.4.4   2020-05-05 [1] CRAN (R 3.6.3)
#>  pkgbuild         1.0.8   2020-05-07 [1] CRAN (R 3.6.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 3.6.2)
#>  pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
#>  prettyunits      1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
#>  processx         3.4.1   2019-07-18 [1] CRAN (R 3.6.2)
#>  ps               1.3.0   2018-12-21 [1] CRAN (R 3.6.2)
#>  purrr            0.3.3   2019-10-18 [1] CRAN (R 3.6.2)
#>  R6               2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
#>  Rcpp             1.0.3   2019-11-08 [1] CRAN (R 3.6.2)
#>  remotes          2.1.1   2020-02-15 [1] CRAN (R 3.6.3)
#>  rlang            0.4.6   2020-05-02 [1] CRAN (R 3.6.3)
#>  rmarkdown        2.1     2020-01-20 [1] CRAN (R 3.6.3)
#>  rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
#>  stringi          1.4.4   2020-01-09 [1] CRAN (R 3.6.2)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
#>  testthat         2.3.2   2020-03-02 [1] CRAN (R 3.6.3)
#>  tibble           3.0.1   2020-04-20 [1] CRAN (R 3.6.3)
#>  tidyselect       1.1.0   2020-05-11 [1] CRAN (R 3.6.3)
#>  usethis          1.6.0   2020-04-09 [1] CRAN (R 3.6.3)
#>  vctrs            0.3.1   2020-06-05 [1] CRAN (R 3.6.3)
#>  withr            2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
#>  xfun             0.12    2020-01-13 [1] CRAN (R 3.6.2)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 3.6.2)
#> 
#> [1] C:/Program Files/R/library

PS: Not sure if this is already adressed by #4538.

Metadata

Metadata

Assignees

No one assigned

    Labels

    HighjoinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsperformanceregression

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions