2. Introduction
Radek Maciaszek
DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse
consultancy.
MSc in Bioinformatics at Birkbeck, University of
London.
Project at UCL Institute of Healthy Ageing under
supervision of Dr Eugene Schuster.
2
3. Primer in Bioinformatics
Bioinformatics - applying computer
science to biology (DNA, Proteins,
Drug discovery, etc)
Ageing strategy – solve it in simple
organism and apply findings to more
complex organisms (i.e. humans).
Goal: find genes responsible for ageing
Caenorhabditis Elegans
3
4. Central dogma of molecular biology
Genes are encoded
by the DNA. Microarray
(100 x 100)
• Database of 50 curated experiments.
• 10k genes compare to each other
4
5. Why R?
Very popular in bioinformatics
Functional, scripting programming
language
Swiss-army knife for statistician
Designed by statisticians for
statisticians
Lots of ready to use packages (CRAN)
5
6. R limitations & Hadoop
Data needs to fit in the memory
Single-threaded
Hadoop integration:
Hadoop Streaming
Rhipe: http://ml.stat.purdue.edu/rhipe/
Segue: http://code.google.com/p/segue/
6
7. Segue
Works with Amazon Elastic MapReduce.
Creates a cluster for you.
Designed for Big Computations (rather than
Big Data)
Implements a cloud version of lapply()
function.
7
9. R very quick example
m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)
$a
[1] 5.5
$b
[1] 4.535125
lapply(X, FUN) returns a list of the same length as X,
each element of which is the result of applying FUN to
the corresponding element of X.
9
10. Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
A.vector <- experiments.matrix[probe,]
p.values <- c()
for(probe.name in rownames(experiments.matrix)) {
B.vector <- experiments.matrix[probe.name,]
p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
}
return (p.values)
}
RNA Probes
> pearson.cor <- lapply(probes, AnalysePearsonCorelation)
Moving to the cloud in 3 lines of code!
10
12. Discovering genes
Topomaps of clustered genes
This work was based on a similar approach to:
A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12
Science 293, 2087 (2001)
13. Conclusions
R is great for statistics.
It’s easy to scale up R using Segue.
We are all going to live very long.
13