SlideShare ist ein Scribd-Unternehmen logo
1 von 14
R Analytics
in the Cloud
Introduction
   Radek Maciaszek
     DataMine Lab (www.dataminelab.com) - Data mining,
      business intelligence and data warehouse
      consultancy.
     MSc in Bioinformatics at Birkbeck, University of
      London.
     Project at UCL Institute of Healthy Ageing under
      supervision of Dr Eugene Schuster.




                                                          2
Primer in Bioinformatics
   Bioinformatics - applying computer
    science to biology (DNA, Proteins,
    Drug discovery, etc)
   Ageing strategy – solve it in simple
    organism and apply findings to more
    complex organisms (i.e. humans).
   Goal: find genes responsible for ageing

Caenorhabditis Elegans
                                              3
Central dogma of molecular biology




Genes are encoded
by the DNA.                                             Microarray
                                                        (100 x 100)
                • Database of 50 curated experiments.
                • 10k genes compare to each other
                                                                      4
Why R?
   Very popular in bioinformatics
   Functional, scripting programming
    language
   Swiss-army knife for statistician
   Designed by statisticians for
    statisticians
   Lots of ready to use packages (CRAN)


                                           5
R limitations & Hadoop
   Data needs to fit in the memory
   Single-threaded
   Hadoop integration:
       Hadoop Streaming
       Rhipe: http://ml.stat.purdue.edu/rhipe/
       Segue: http://code.google.com/p/segue/




                                                  6
Segue
   Works with Amazon Elastic MapReduce.
   Creates a cluster for you.
   Designed for Big Computations (rather than
    Big Data)
   Implements a cloud version of lapply()
    function.



                                                 7
Segue workflow (emrlapply)

List (local)




               List (remote)


                               Amazon AWS   8
R very quick example
m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)
$a
[1] 5.5
$b
[1] 4.535125

lapply(X, FUN) returns a list of the same length as X,
each element of which is the result of applying FUN to
the corresponding element of X.
                                                         9
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}
                                                                     RNA Probes
> pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!



                                                                                  10
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}
                                                                     RNA Probes
> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”,
               slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”,
               slaveInstanceType=”c1.xlarge”, copy.image=TRUE)
> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)
> stopCluster(myCluster)                                                          11
Discovering genes




                       Topomaps of clustered genes
This work was based on a similar approach to:
A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al.,   12
Science 293, 2087 (2001)
Conclusions
   R is great for statistics.
   It’s easy to scale up R using Segue.
   We are all going to live very long.




                                           13
Thanks!
   Questions?

   References:
    http://code.google.com/r/radek-segue/
    http://www.dataminelab.com




                                            14

Weitere ähnliche Inhalte

Andere mochten auch

Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Real time analytics with Netty, Storm, Kafka
Real time analytics with Netty, Storm, KafkaReal time analytics with Netty, Storm, Kafka
Real time analytics with Netty, Storm, Kafka
Trieu Nguyen
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

Andere mochten auch (11)

Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Real time analytics with Netty, Storm, Kafka
Real time analytics with Netty, Storm, KafkaReal time analytics with Netty, Storm, Kafka
Real time analytics with Netty, Storm, Kafka
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 

Ähnlich wie R Analytics in the Cloud

Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
Klaus Schliep
 

Ähnlich wie R Analytics in the Cloud (20)

Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
R Basics
R BasicsR Basics
R Basics
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
User biglm
User biglmUser biglm
User biglm
 
PPT
PPTPPT
PPT
 
High-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for MedicineHigh-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for Medicine
 
Data mining
Data mining Data mining
Data mining
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in Genomics
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

R Analytics in the Cloud

  • 2. Introduction  Radek Maciaszek  DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  MSc in Bioinformatics at Birkbeck, University of London.  Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster. 2
  • 3. Primer in Bioinformatics  Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)  Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).  Goal: find genes responsible for ageing Caenorhabditis Elegans 3
  • 4. Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other 4
  • 5. Why R?  Very popular in bioinformatics  Functional, scripting programming language  Swiss-army knife for statistician  Designed by statisticians for statisticians  Lots of ready to use packages (CRAN) 5
  • 6. R limitations & Hadoop  Data needs to fit in the memory  Single-threaded  Hadoop integration:  Hadoop Streaming  Rhipe: http://ml.stat.purdue.edu/rhipe/  Segue: http://code.google.com/p/segue/ 6
  • 7. Segue  Works with Amazon Elastic MapReduce.  Creates a cluster for you.  Designed for Big Computations (rather than Big Data)  Implements a cloud version of lapply() function. 7
  • 8. Segue workflow (emrlapply) List (local) List (remote) Amazon AWS 8
  • 9. R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean) $a [1] 5.5 $b [1] 4.535125 lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 9
  • 10. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > pearson.cor <- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! 10
  • 11. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > # pearson.cor <- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) 11
  • 12. Discovering genes Topomaps of clustered genes This work was based on a similar approach to: A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12 Science 293, 2087 (2001)
  • 13. Conclusions  R is great for statistics.  It’s easy to scale up R using Segue.  We are all going to live very long. 13
  • 14. Thanks!  Questions?  References: http://code.google.com/r/radek-segue/ http://www.dataminelab.com 14

Hinweis der Redaktion

  1. Check Segue, LISP, R, circle