A quick review and demonstration on how to get started on parallel computing with R. Includes an example of SNOW cluster set up in the departmental lab.
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Parallel Computing with R
1. Parallel Computing with R
Parallel Computing with R
Literature Seminar
Abhirup Mallik
malli066@umn.edu
School of Statistics
University of Minnesota
November 15, 2013
2. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
3. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
4. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
5. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
6. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
7. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
8. Parallel Computing with R
How to go Parallel?
Using Multicore (Implicit Parallelism)
Main process forks to child process which runs in parallel in
different cores.
1 library ( parallel )
2 mclapply (X , FUN , ...)
Or use
1
2
3
4
5
6
library ( parallel )
... setup stuff ..
for ( isplit in 1: nsplit ) {
mcparallel ( some R expression involving isplit )
}
out <- collect ()
9. Parallel Computing with R
How to go Parallel?
Warnings:
All child process compete for memory.
Closing terminal or closing any graphical window only kills
parent.
’CRTL + C’ Kills the parent, not the children.
Kill the children if they are unresponsive.
10. Parallel Computing with R
How to go Parallel?
Using SNOW (Explicit Parallelism)
Make a cluster by any one of these options
1 cl <- makeCluster ( spec , type , ...)
2 cl <- m a k e P S O C K c l u s t e r ( names , ...)
3 cl <- ma ke F or kC lu s te r ( nnodes = , ...)
Export essential objects to the cluster:
1 clusterExport ( cl , c ( var1 , fun1 , ..) )
Evaluate on cluster:
1 clusterEvalQ ( cl , expr )
2 parLapply ( cl = NULL , X , fun , ...)
3 parSapply ( cl = NULL , X , fun , ...)
Stop the cluster
11. Parallel Computing with R
Demonstration
Demonstration
Using Swiss fertility data from 1888 (R-base).
1 > str ( swiss )
2 ’ data . frame ’: 47 obs . of
3 $ Fertility
: num
4 $ Agriculture
: num
5 $ Examination
: int
6 $ Education
: int
7 $ Catholic
: num
8 $ Infant . Mortality : num
6 variables :
80.2 83.1 92.5 85.8 76.9 76.1 ...
17 45.1 39.7 36.5 43.5 35.3 ...
15 6 5 12 17 9 16 14 12 16 ...
12 9 5 7 15 7 7 8 7 13 ...
9.96 84.84 93.4 33.77 5.16 ...
22.2 22.2 20.2 20.3 20.6 26.6 ...
12. Parallel Computing with R
Demonstration
Demonstration
10 fold cross validation
1 fold <- sample ( seq (1 , 10) , size = nrow ( swiss ) ,
2
replace = TRUE )
Cross validation for ’i’th Fold
1 fold . cv <- function ( i ) {
2 train <- swiss [ fold ! = i , ]
3 test <- swiss [ fold == i , ]
4 swiss . rf <- randomForest ( sqrt ( Fertility ) ~ .
5
- Catholic + I ( Catholic < 50) , data = train )
6 predict . test <- predict ( swiss . rf , test , type = " response " )
7 actual . test <- sqrt ( test $ Fertility )
8 err <- predict . test - actual . test
9 sum ( err * err )
10 }
13. Parallel Computing with R
Demonstration
How to create a cluster?
Create a local cluster of size 4 (parallel socket)
1 cl <- m a k e P S O C K c l u s t e r (4)
Create a local cluster on different cores of the CPU (8 cores).
1 cl <- ma ke F or kC lu s te r (8)
14. Parallel Computing with R
Demonstration
How to create a cluster in our LAB?
Create password less log in using ssh keygen (from Shell):
1 ssh - keygen -t dsa
2 cat ~ / . ssh / id _ dsa . pub >> ~ / . ssh / authorized _ keys
#check which computers are running
1 grephosts LAB
2 # Then ssh all the computers you want to connect to once ,
and it will be remembered for the session .
Now we are ready to make a cluster:
1 library ( parallel )
2 machines <- c ( " crab " , " sugar " , " strike " , " hyland " , " lovejoy "
, " driller " )
3 address <- rapply ( lapply ( machines , nsl ) , c )
4 cl <- m a k e P S O C K c l u s t e r ( address )
15. Parallel Computing with R
Demonstration
How to create a cluster in our LAB?
If you are connecting to stat.umn.edu from your own computer, to
create a password-less ssh session:
1 ssh - keygen -t dsa
2 # Then use scp to copy id _ dsa . pub to ~ / . ssh / authorized _ keys
16. Parallel Computing with R
Demonstration
Comparison
On cluster:
1
2
3
4
5
6
7
8
9
10
> system . time ({
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.008
0.000
0.838
On Multicore:
1 > system . time ({
2 +
res1 <- do . call (c , mclapply (1:10 , fold . cv , mc . cores = 8) )
3
4
})
user
0.386
system elapsed
0.162
0.120
17. Parallel Computing with R
Demonstration
Using Fork cluster:
1
2
3
4
5
6
7
8
9
10
11
> system . time ({
+
cl <- m ak eF o rk Cl us t er (8)
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.010
0.054
0.153
Without any parallelization:
1 > system . time ({
2 +
res2 <- do . call (c , lapply (1:10 , fold . cv ) )
3 +
})
4
user system elapsed
5
0.233
0.000
0.235
18. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
19. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
20. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
21. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
22. Parallel Computing with R
To infinity and beyond
What is beyond the wall?
Parallelization in Big data framework: RHadoop
Other and related implementations of parallelization: MPI,
NWS, etc...
Other cool libraries: foreach, snowfall, etc...
GPU !!
23. Parallel Computing with R
Where to get codes?
Where to get the codes?
All the codes in this presentation is available at :
https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R
24. Parallel Computing with R
References
Acknowledgements and References
Sincere thanks to Charles Geyer
Resourceful slides by Ryan Rosario.
Some other and more resourceful slides.
Parallel R Book