SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Extending lifespan
with R and Hadoop
    Radek Maciaszek
    Founder of DataMine Lab, CTO
    Ad4Game, studying towards
    PhD in Bioinformatics at UCL
Agenda
●   Project background
●   Parallel computing in R
●   Hadoop + R
●   Future work (Storm)
●   Results and summary




                              2
Project background
●   Lifespan extension - project at UCL during MSc in
    Bioinformatics
●   Bioinformatics – computer science in biology (DNA,
    Proteins, Drug discovery, etc.)
●   Institute of Healthy Ageing at UCL – lifespan is a king.
    Dozens of scientists, dedicated journals.
●
    Ageing is a complex process or is it? C. Elegans (2x by
    a single gene DAF-2, 10x).
●   Goal of the project: find genes responsible for ageing



                                                               3
                    Caenorhabditis Elegans
Primer in Bioinformatics
●   Central dogma of molecular biology
●   Cell (OS+3D), Gene (Program), TF (head on HDD)
●   How to find ageing genes (such as DAF-2)?




                                                                    4
                                                Images: Wikipedia
RNA microarray




   DAF-2 pathway in C. elegans
   Source: Partridge & Gems, 2002   Source: Staal et al, 2003   5
Goal: raw data → network
                         Genes Network
                         ● Pairwise comparisons of
                           10k x 10k genes +
                           clustering




   100 x 100 x 50 x 10
     (~10k genes)

                                                     6
Why R?
●   Incredibly powerful for data science with
    big data
●   Functional, scripting programming
    language with many packages.
●   Popular in mathematics, bioinformatics,
    finance, social science and more.
●   TechCrunch lists R as trendy technology for
    BigData.
●   Designed by statisticians for statisticians
                                                  7
R example
K-Means clustering
require(graphics)

x <- rbind(matrix(rnorm(100, sd = 0.3),
        ncol = 2),
        matrix(rnorm(100, mean = 1,
        sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2,
       pch = 8, cex=2)




                                          8
R limitations & Hadoop
●   10k x 10k (100MM) Fisher exact
    correlations is slow
●   Memory allocation is a common problem
●   Single-threaded
●   Hadoop integration:
    –   Hadoop Streaming
    –   Rhipe: http://ml.stat.purdue.edu/rhipe/
    –   Segue: http://code.google.com/p/segue/

                                                  9
Scaling R
●   Explicit
    –   snow, parallel, foreach
●   Implicit
    –   multicore (2.14.0)
●   Hadoop
    –   RHIPE, rmr, Segue, RHadoop
●   Storage
    –   rhbase, rredis, Rcassandra, rhdfs

                                            10
R and Hadoop
  ●   Streaming API (low level)
mapper.R

#!/usr/bin/env Rscript
in <- file(“stdin”, “r”)
while (TRUE) {
   lineStr <- readLines(in, n=1)
   line <- unlist(strsplit(line, “,”))
   ret = expensiveCalculations(line)
   cat(data, “n”, sep=””)
}
close(in)

jar hadoop-streaming-*.jar –input data.csv –output data.out –mapper mapper.R




                                                                               11
RHIPE
●   Can use with your Hadoop cluster
●   Write mappers/reduces using R only
                                    map <- expression({
     z <-                             f <- table(unlist(strsplit(unlist(
     rhmr(map=map,reduce=reduce,           map.values)," ")))
     inout=c("text","sequence")       n <- names(f)
           ,ifolder=filename          p <- as.numeric(f)
           ,ofolder=sprintf("%s-      sapply(seq_along(n),function(r)
            out",filename))                  rhcollect(n[r],p[r]))
                                    })
     job.result <-
     rhstatus(rhex(z,async=TRUE),   reduce <- expression(
              mon.sec=2)               pre={ total <- 0},
                                       reduce = { total <-
                                         total+sum(unlist(reduce.values)) },
                                       post = { rhcollect(reduce.key,total) }
                                     )
                                                                           12
                                      Example from Rhipe Wiki
Segue
●   Works with Amazon Elastic MapReduce.
●   Creates a cluster for you.
●   Designed for Big Computations (rather than
    Big Data)
●   Implements a cloud version of lapply()
●   Parallelization in 2 lines of code!
●   Allowed us to speed up calculations down
    to 2h with the use of 16 servers

                                                 13
Segue workflow (emrlapply)




                             14
lapply()
m <- list(a = 1:10, b = exp(-3:3))

lapply(m, mean)$a
[1] 5.5
$b
[1] 4.535125

lapply(X, FUN)
returns a list of the same length as X, each element of which is
the result of applying FUN to the corresponding element of X.




                                                                   15
Segue in a cluster
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}

> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!




                                                                     16
Segue in a cluster
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}

> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
> myCluster <- createCluster(numInstances=5,
     masterBidPrice="0.68”, slaveBidPrice="0.68”,
     masterInstanceType=”c1.xlarge”,
     slaveInstanceType=”c1.xlarge”, copy.image=TRUE)
> pearson.cor <- emrlapply(myCluster, probes,
   AnalysePearsonCorelation)
> stopCluster(myCluster)


                                                                     17
R + HBase
library(rhbase)
hb.init(serialize="raw")

#create new table
hb.new.table("mytable", "x","y","z",opts=list(y=list(compression='GZ')))

#insert some values into the table
hb.insert("mytable",list( list(1,c("x","y","z"),list("apple","berry","cherry"))))

rows<-hb.scan.ex("mytable",filterstring="ValueFilter(=,'substring:ber')")
rows$get()




    https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase


                                                                                    18
Discovering genes
                                         Topomaps of clustered genes




  This work was based on:A Gene Expression Map for
  Caenorhabditis elegans, Stuart K. Kim, et al., Science 293,
  2087 (2001)

                                                                       19
Genes clusters


                                 Clusters based on Fisher
                                 exactpairwise genes comparisons




    Green lines represent random probes
    Red lines represent up-regulated probes
    Blue lines are down-regulated probes
    (in daf-2 vs daf-2;daf-16 experiment)                          20
Genes networks




    Network created with Cytoscape, platform
    for complex network analysis:
    http://www.cytoscape.org/
                                               21
Future work - real time R
●   Hadoop has high throughput but for small
    tasks is slow. It is not good for continuous
    calculations.
●   A possible solution is to use Storm
●   Storm multilang can be used with any
    language, including R




                                                   22
Storm R

                                   Storm may be easily integrated with
                                   third party languages and databases:

                                   ●   Java
                                   ●   Python
                                   ●   Ruby

                                   ●   Redis
                                   ●   Hbase
                                   ●   Cassandra



 Image source: Storm github wiki



                                                                          23
Storm R
 source("storm.R")

 initialize <- function()
 {
    emitBolt(list("bolt initializing"))
 }

 process <- function(tup)
 {
   word <- tup$tuple
   rand <- runif(1)
   if (rand < 0.75) {
       emitBolt(list(word + "lalala"))
   } else {
       log(word + " randomly skipped!")
   }
 }

 boltRun(process, initialize)

                          https://github.com/rathko/storm   24
Summary
●   It’s easy to scale R using Hadoop.
●   R is not only great for statistics, it is a versatile
    programming language.
●   Is ageing a disease? Are we all going to live very long
    lives?




                                                              25
Questions?
●   References:
    http://hadoop.apache.org/
    http://hbase.apache.org/
    http://code.google.com/p/segue/
    http://www.datadr.org/
    https://github.com/RevolutionAnalytics/
    https://github.com/rathko/storm




                                              26

Weitere ähnliche Inhalte

Was ist angesagt?

Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Hypertable
HypertableHypertable
Hypertablebetaisao
 
Python for R Users
Python for R UsersPython for R Users
Python for R UsersAjay Ohri
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertablehypertable
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Tokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo TyrantTokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo Tyrant輝 子安
 
GraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE GraphGraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE GraphJim Hatcher
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 

Was ist angesagt? (20)

Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Hypertable
HypertableHypertable
Hypertable
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertable
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Tokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo TyrantTokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo Tyrant
 
GraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE GraphGraphFrames Access Methods in DSE Graph
GraphFrames Access Methods in DSE Graph
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 

Andere mochten auch

R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 

Andere mochten auch (9)

R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 

Ähnlich wie Extending lifespan with Hadoop and R

RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)Daniel Nüst
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf
 
IIUG 2016 Gathering Informix data into R
IIUG 2016 Gathering Informix data into RIIUG 2016 Gathering Informix data into R
IIUG 2016 Gathering Informix data into RKevin Smith
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.pptMalkaParveen3
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
Rstudio is an integrated development environment for R that allows users to i...
Rstudio is an integrated development environment for R that allows users to i...Rstudio is an integrated development environment for R that allows users to i...
Rstudio is an integrated development environment for R that allows users to i...SWAROOP KUMAR K
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 

Ähnlich wie Extending lifespan with Hadoop and R (20)

User biglm
User biglmUser biglm
User biglm
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
 
IIUG 2016 Gathering Informix data into R
IIUG 2016 Gathering Informix data into RIIUG 2016 Gathering Informix data into R
IIUG 2016 Gathering Informix data into R
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.ppt
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Rstudio is an integrated development environment for R that allows users to i...
Rstudio is an integrated development environment for R that allows users to i...Rstudio is an integrated development environment for R that allows users to i...
Rstudio is an integrated development environment for R that allows users to i...
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Lrz kurse: r as superglue
Lrz kurse: r as superglueLrz kurse: r as superglue
Lrz kurse: r as superglue
 

Kürzlich hochgeladen

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Extending lifespan with Hadoop and R

  • 1. Extending lifespan with R and Hadoop Radek Maciaszek Founder of DataMine Lab, CTO Ad4Game, studying towards PhD in Bioinformatics at UCL
  • 2. Agenda ● Project background ● Parallel computing in R ● Hadoop + R ● Future work (Storm) ● Results and summary 2
  • 3. Project background ● Lifespan extension - project at UCL during MSc in Bioinformatics ● Bioinformatics – computer science in biology (DNA, Proteins, Drug discovery, etc.) ● Institute of Healthy Ageing at UCL – lifespan is a king. Dozens of scientists, dedicated journals. ● Ageing is a complex process or is it? C. Elegans (2x by a single gene DAF-2, 10x). ● Goal of the project: find genes responsible for ageing 3 Caenorhabditis Elegans
  • 4. Primer in Bioinformatics ● Central dogma of molecular biology ● Cell (OS+3D), Gene (Program), TF (head on HDD) ● How to find ageing genes (such as DAF-2)? 4 Images: Wikipedia
  • 5. RNA microarray DAF-2 pathway in C. elegans Source: Partridge & Gems, 2002 Source: Staal et al, 2003 5
  • 6. Goal: raw data → network Genes Network ● Pairwise comparisons of 10k x 10k genes + clustering 100 x 100 x 50 x 10 (~10k genes) 6
  • 7. Why R? ● Incredibly powerful for data science with big data ● Functional, scripting programming language with many packages. ● Popular in mathematics, bioinformatics, finance, social science and more. ● TechCrunch lists R as trendy technology for BigData. ● Designed by statisticians for statisticians 7
  • 8. R example K-Means clustering require(graphics) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex=2) 8
  • 9. R limitations & Hadoop ● 10k x 10k (100MM) Fisher exact correlations is slow ● Memory allocation is a common problem ● Single-threaded ● Hadoop integration: – Hadoop Streaming – Rhipe: http://ml.stat.purdue.edu/rhipe/ – Segue: http://code.google.com/p/segue/ 9
  • 10. Scaling R ● Explicit – snow, parallel, foreach ● Implicit – multicore (2.14.0) ● Hadoop – RHIPE, rmr, Segue, RHadoop ● Storage – rhbase, rredis, Rcassandra, rhdfs 10
  • 11. R and Hadoop ● Streaming API (low level) mapper.R #!/usr/bin/env Rscript in <- file(“stdin”, “r”) while (TRUE) { lineStr <- readLines(in, n=1) line <- unlist(strsplit(line, “,”)) ret = expensiveCalculations(line) cat(data, “n”, sep=””) } close(in) jar hadoop-streaming-*.jar –input data.csv –output data.out –mapper mapper.R 11
  • 12. RHIPE ● Can use with your Hadoop cluster ● Write mappers/reduces using R only map <- expression({ z <- f <- table(unlist(strsplit(unlist( rhmr(map=map,reduce=reduce, map.values)," "))) inout=c("text","sequence") n <- names(f) ,ifolder=filename p <- as.numeric(f) ,ofolder=sprintf("%s- sapply(seq_along(n),function(r) out",filename)) rhcollect(n[r],p[r])) }) job.result <- rhstatus(rhex(z,async=TRUE), reduce <- expression( mon.sec=2) pre={ total <- 0}, reduce = { total <- total+sum(unlist(reduce.values)) }, post = { rhcollect(reduce.key,total) } ) 12 Example from Rhipe Wiki
  • 13. Segue ● Works with Amazon Elastic MapReduce. ● Creates a cluster for you. ● Designed for Big Computations (rather than Big Data) ● Implements a cloud version of lapply() ● Parallelization in 2 lines of code! ● Allowed us to speed up calculations down to 2h with the use of 16 servers 13
  • 15. lapply() m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean)$a [1] 5.5 $b [1] 4.535125 lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 15
  • 16. Segue in a cluster > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > # pearson.cor <- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! 16
  • 17. Segue in a cluster > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > # pearson.cor <- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) 17
  • 18. R + HBase library(rhbase) hb.init(serialize="raw") #create new table hb.new.table("mytable", "x","y","z",opts=list(y=list(compression='GZ'))) #insert some values into the table hb.insert("mytable",list( list(1,c("x","y","z"),list("apple","berry","cherry")))) rows<-hb.scan.ex("mytable",filterstring="ValueFilter(=,'substring:ber')") rows$get() https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase 18
  • 19. Discovering genes Topomaps of clustered genes This work was based on:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001) 19
  • 20. Genes clusters Clusters based on Fisher exactpairwise genes comparisons Green lines represent random probes Red lines represent up-regulated probes Blue lines are down-regulated probes (in daf-2 vs daf-2;daf-16 experiment) 20
  • 21. Genes networks Network created with Cytoscape, platform for complex network analysis: http://www.cytoscape.org/ 21
  • 22. Future work - real time R ● Hadoop has high throughput but for small tasks is slow. It is not good for continuous calculations. ● A possible solution is to use Storm ● Storm multilang can be used with any language, including R 22
  • 23. Storm R Storm may be easily integrated with third party languages and databases: ● Java ● Python ● Ruby ● Redis ● Hbase ● Cassandra Image source: Storm github wiki 23
  • 24. Storm R source("storm.R") initialize <- function() { emitBolt(list("bolt initializing")) } process <- function(tup) { word <- tup$tuple rand <- runif(1) if (rand < 0.75) { emitBolt(list(word + "lalala")) } else { log(word + " randomly skipped!") } } boltRun(process, initialize) https://github.com/rathko/storm 24
  • 25. Summary ● It’s easy to scale R using Hadoop. ● R is not only great for statistics, it is a versatile programming language. ● Is ageing a disease? Are we all going to live very long lives? 25
  • 26. Questions? ● References: http://hadoop.apache.org/ http://hbase.apache.org/ http://code.google.com/p/segue/ http://www.datadr.org/ https://github.com/RevolutionAnalytics/ https://github.com/rathko/storm 26