R Analytics in the Cloud

2. Introduction  Radek Maciaszek  DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  MSc in Bioinformatics at Birkbeck, University of London.  Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster. 2

3. Primer in Bioinformatics  Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)  Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).  Goal: find genes responsible for ageing Caenorhabditis Elegans 3

4. Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other 4

5. Why R?  Very popular in bioinformatics  Functional, scripting programming language  Swiss-army knife for statistician  Designed by statisticians for statisticians  Lots of ready to use packages (CRAN) 5

6. R limitations & Hadoop  Data needs to fit in the memory  Single-threaded  Hadoop integration:  Hadoop Streaming  Rhipe: http://ml.stat.purdue.edu/rhipe/  Segue: http://code.google.com/p/segue/ 6

7. Segue  Works with Amazon Elastic MapReduce.  Creates a cluster for you.  Designed for Big Computations (rather than Big Data)  Implements a cloud version of lapply() function. 7

8. Segue workflow (emrlapply) List (local) List (remote) Amazon AWS 8

9. R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean) $a [1] 5.5 $b [1] 4.535125 lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 9

10. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > pearson.cor <- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! 10

11. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > # pearson.cor <- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) 11

12. Discovering genes Topomaps of clustered genes This work was based on a similar approach to: A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12 Science 293, 2087 (2001)

13. Conclusions  R is great for statistics.  It’s easy to scale up R using Segue.  We are all going to live very long. 13

14. Thanks!  Questions?  References: http://code.google.com/r/radek-segue/ http://www.dataminelab.com 14

Hinweis der Redaktion

Check Segue, LISP, R, circle

R Analytics in the Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie R Analytics in the Cloud

Ähnlich wie R Analytics in the Cloud (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

R Analytics in the Cloud

Hinweis der Redaktion