-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathparallel-R-v2.Rmd
More file actions
135 lines (109 loc) · 4.8 KB
/
parallel-R-v2.Rmd
File metadata and controls
135 lines (109 loc) · 4.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "Testing parallel in R"
date: "October 2015"
output:
html_document:
fig_width: 15
fig_height: 4
---
```{r libs_and_global_options, include = FALSE}
library(knitr)
opts_chunk$set(results="hold", warning=FALSE, message=FALSE, cache=TRUE) ## change to FALSE to re-eval whole doc
library(dplyr); library(magrittr); library(ggplot2)
library(parallel)
library(numbers) # for the isPrime function
```
This note has two purposes:
- to test R's `parallel` package's performance on a basic multicore machine
- to test a wrapper function around `base::lapply` and `parallel::parLapply`
### 1. Definition of varLapply
My objective for `varLapply` is to be a no-hassle wrapper around `lapply` and `parLapply`, with a syntax similar to `lapply` and a simple boolean `use_parallel` argument that could be switched on or off algorithmically.
```{r varLapply_definition}
varLapply <- function(X, FUN, # same syntax as base::lapply
use_parallel = TRUE, # use parLapply by default
number_of_nodes = detectCores()-1, # keep 1 core free by default
par_apply_function = "parLapply", # allows to switch to mclapply
...) {
if (!use_parallel) {
lapply(X, FUN, ...)
} else {
require(parallel)
if (par_apply_function == "parLapply") {
tmp_cluster <- makeForkCluster(nnodes = max(number_of_nodes,1))
clusterExport(cl = tmp_cluster, varlist = c(), envir = parent.frame())
output <- parLapply(cl = tmp_cluster, X = X, fun = FUN, ...)
stopCluster(tmp_cluster)
return(output)
} else if (par_apply_function == "mclapply") {
mclapply(X = X, FUN = FUN, mc.cores = number_of_nodes)
}
}
}
```
The purpose of `clusterExport(... envir = parent.frame())` is to import all variables/functions/objects from `varLapply`'s parent. There are probably more subtle ways of doing this, but it has worked so far.
### 2. Test 1: isPrime
```{r prime_test_definition`}
library(dplyr); library(magrittr); library(ggplot2)
library(parallel)
library(numbers) # for the isPrime function
prime_test <- function(N, # a vector, the integers we test for primality
K, # a vector, the lengths of the [l/parL/mcl]apply loops
P) { # an integer, the number of times each test is repeated for averaging
N.l <- length(N)
K.l <- length(K)
S.l <- N.l * K.l * P * 2
tests <- data.frame(matrix(NA, nrow = S.l, ncol = 5))
colnames(tests) <- c("n", # the integer tested for primality
"k", # the number of times n is tested
"p", # the number of times each test is repeated
"b", # if (b), uses a parallel version of lapply
"time") # the measured test time (elapsed in system.time)
i <- 1
for (n in 1:N.l) {
for (k in 1:K.l) {
for (p in 1:P) {
for (b in c(FALSE, TRUE)) {
tests[i,] <- c(N[n], K[k], p, b, 0.0)
i <- i+1
}}}}
# each row of tests corresponds to a test unit where we run
# isPrime(n) k times using varLapply(..., use_parallel = b);
# output of system.time (elapsed) is stored in the time column;
# each case is repeated P times for averaging;
# we sample test units in random order to reduce bias potential
for (j in sample(1:S.l)) {
tests[j,"time"] <- system.time(varLapply(X = rep(tests[j,"n"], tests[j,"k"]),
FUN = isPrime,
use_parallel = tests[j,"b"],
par_apply_function = "mclapply")
)[3]
gc()
}
return(tests)
}
```
```{r run_test_1, cache=FALSE}
set.seed(123456)
rdm_primes <- Primes(10^2,10^9) %>% sample(size = 1000)
loop_sizes <- c(10,25,50,100,250,1000)
test_1 <- prime_test(N = rdm_primes, K = loop_sizes, P = 5)
save(test_1, file = "./test_2.Rdata")
```
---
Links that I have found useful:
- http://stackoverflow.com/questions/12019638/using-parallels-parlapply-unable-to-access-variables-within-parallel-code
- https://trinkerrstuff.wordpress.com/2012/08/19/parallelization-speed-up-functions-in-a-package/
- https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
- http://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf
Links that look useful and I still need to read:
- http://blog.dominodatalab.com/simple-parallelization/
---
R version:
```{r, echo=FALSE}
version[c("platform", "os", "version.string")]
```
```{r}
detectCores()
```
*Author: Alexandre Halm*
*Edited with [RStudio](http://www.rstudio.com/products/RStudio/#Desk)*