1. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Crunching Molecules and Numbers in R
Rajarshi Guha
NIH Chemical Genomics Center
238th ACS National Meeting
17th August, 2009
2. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Outline
Some background on R
Doing cheminformatics in R
3. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R History
S developed by
John Chambers at
Bell Labs
1976
S rewritten in C
1988
Licensed to
Insightful Corp.
1993
Bought by Insightful
Corp for $2M
2004
Bought by TIBCO
for $25M
2008
First public release
1993
Created by Ihaka &
Gentleman
1991
Released under
GPL
1995
R 1.0.0
2000
R 2.9
2009
4. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview of R
An environment for statistical computation
Wide variety of standard and state of the art statistical
methods built in or accessible via packages
But also a complete, interpreted programming language
Well suited for manipulating and operating on datasets -
numerical, categorical or a mixture - and of varying
shape
Impressive visualization facilities (but not very
interactive)
5. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview of R
Syntax is pretty much S-Plus
Highly cross-platform
Frequent and regular releases, active development by
core group
The dev and user community extremely active
r-help is not just for learning R, you can get a decent
statistics education from the list!
Used by many top statisticians, many cutting edge
techniques first show up in R
6. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Usability
Default mode is a command line like prompt
GUI’s available
But learning curve is steep
Does force you to think about the analysis
Not a great tool for casual, once-in-a-while usage
7. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R primitives
Numeric, character, list, matrix, data.frame
§ ¤
> x <- ’Hello World ’
> x <- 1
> x <- c(1,2,3,4,5,6)
> x
[1] 1 2 3 4 5 6
x <- data.frame(MW=runif(5, 10, 50),
hERG=sample(c(’active ’,’inactive ’),
5, TRUE ))
> x
MW hERG
1 23.55435 active
2 42.90365 inactive
3 49.35149 active
4 26.85912 active
5 10.01877 active
¦ ¥
8. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Matrix oriented programming
Similar in style to Matlab
Easily access (multiple) rows, columns
Vector/matrix indexing is very powerful and key to
efficient R code
Perform operations on entire rows or columns
Makes subsetting a trivial operation
Perfect for QSAR type analyses
9. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional style
R’s functional paradigms are closely tied to matrix
operations
apply, lapply, sapply, tapply allow you to easily
operate on groups of objects
Elements of a list
Rows and/or columns of a matrix
Subsets of data, using a grouping variable
Anonymous functions are supported
Use of these funtional forms can lead to speed up
compared to traditional for loops
10. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Non-functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
sds <- numeric(ncol(m))
for (i in 1: ncol(m)) sds[i] <- sd(m[,i])
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
toxLogP <- 0
nontoxLogP <- 0
for (j in 1: nrow(m)) {
if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1]
else nontoxLogP <- nontoxLogP + m[j,1]
}
¦ ¥
11. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
apply(m, 2, sd)
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
by(m, m$toxic , function(x) mean(x$logp ))
¦ ¥
12. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Object oriented style
R supports multiple object oriented mechanisms
Simplest is S3 classes
Object orientation is in terms of function names
Easy to work with, not always flexible enough
S4 classes are much more powerful, but also more
complex
Many problems can ignore these as R primitives provide
sufficient support for attaching meta-data to objects
(crude encapsulation)
Becomes important/useful when writing packages, not
for day to day code
13. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Interfacing with C & Fortran
R is interpreted,
functional forms help a
bit
Very useful to refactor
inner loops into C (or
Fortran)
Also useful to provide an
R interface to pre-existing
C/Fortran code
Can lead to dramatic
speedups
1024 166 79
Bit length
Speedup
01020304050
5000 pairwise Tanimoto similarity
calculations, Macbook Pro,
2GHz, 1GB RAM
14. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Visualization
R generates publication quality graphics in a variety of
formats
A huge number of statistical visualization methods (2D,
3D, OpenGL)
Extremely powerful display specifications
core commands
lattice (a.k.a trellis graphics)
Based on sound statistical theories
While standard plots are easy to make, but complex
plots do have a learning curve
Interactivity is limited, though some package do alleviate
this
15. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Code quality
It’s not just enough to write code
RUnit is a package that supports unit testing, analogous
to JUnit
R comes with well defined package structure that can be
automatically checked for various errors
Packages can be uploaded to CRAN which allows any R
user to install them directly from R
Extensive documentation format
Sweave is an important feature which allows one to
include R code and associated text in a single document
- literate programming or reproducible research
16. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The downsides of R
Memory bound (but can use as much memory as you
have)
Language inconsistencies
Indexing starts from 1, but no error if you use 0 as an
index
See blog posts by Radford Neal (U Toronto)
Debugging environment not so great (though ESS is
good for Emacs users)
17. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Cheminformatics programming
Fundamental requirement is support for core chemical
concepts
Representation and manipulation of these concepts
Flexibility
Could implement all of this directly in R - lots of wheels
would be reinvented
We also want such functionality to be R-like
Writing Java or C in R is not R-like
18. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The Chemistry Development Kit
Open source Java library for cheminformatics
Wide variety of functionality
Core chemical concepts (atoms, bonds, molecules)
SMARTS, pharmacophores
Molecular descriptors and fingerprints
2D depictions
Used in a variety of tools, applications and services
Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
19. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk - CDK from R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
20. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk Motivations
Have access to cheminformatics functionality from
within R
Support processing of data from chemistry databases
Not reimplement cheminformatics methods
Have access to all of this in idiomatic R
21. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecular operations - I/O
Read in molecular file formats support by the CDK
Files can be local or remote
Parse SMILES strings
In contrast to the CDK, rcdk will configure molecules
automatically (unless instructed not to)
The resultant molecule objects are Java references, can
be passed to a variety of rcdk functions
§ ¤
mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’))
mol <- parse.smiles(’c1ccccc1CC (=O)’)
mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’),
parse.smiles)
¦ ¥
22. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecular operations
Given a molecule, we can extract or add properties
Get lists of atoms and bonds and then manipulate them
Currently doesn’t support a lot of molecular graph
operations
§ ¤
# get the atoms from a molecule
mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ")
atoms <- get.atoms(mol)
# get the coordinate matrix of the molecule
coords <- do.call(’rbind ’,
lapply(atoms , get.point3d ))
¦ ¥
23. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Working with fingerprints
rcdk will generate a variety of fingerprints via the CDK
Other packages can generate fingerprints
The fingerprint package suports I/O of fingerprint
data and various similarity operations on fingerprints
Provides an S4 class representing binary fingerprints
§ ¤
m1 <- parse.smiles(’c1ccccc1C(COC)N’)
m2 <- parse.smiles(’C1CCCCC1C(COC)N’)
# Calculate fingerprints
fps <- lapply(list(m1 ,m2),
get.fingerprint , type=’maccs ’)
distance(fps [[1]] , fps [[2]] , method=’tanimoto ’)
fps <- fp.read(’fp.txt ’, lf=moe.lf ,
size =166, header=TRUE)
fpsim <- fp.sim.matrix(fps , method=’tanimoto ’)
¦ ¥
24. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSAR
Molecular
Descriptors
Machine
Learning
Property
25. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSAR
Access to descriptors and fingerprints makes for very
easy QSAR modeling within R
Evaluate the descriptors (individually, by type or all)
Get back a data.frame which can be used as input to
pretty much any modeling method
§ ¤
mols <- load.molecules(’big.sdf ’)
dnames <- get.desc.names(’topological ’)
descs <- eval.desc(mols , dnames)
str(descs)
’data.frame ’: 467 obs. of 180 variables:
$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...
$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...
$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ..
$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...
¦ ¥
26. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Viewing molecules
While numerical modeling is a fundamental task in this
environment, visualization is also important
Either view structures of individual molecules or tables of
structure and data
rcdk supports both (not very well on OS X)
§ ¤
mol <- parse.smiles(’c1ccccc1C(N)CC ’)’
view.molecule .2d(mol)
smiles <- c("CCC", "CCN", "CCN(C)(C)",
" c1ccccc1Cc1ccccc1 ",
"C1CCC1CC(CN(C)(C))CC(=O)CC")
mols <- sapply(smiles , parse.smiles)
view.molecule .2d(mols)
¦ ¥
27. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Downsides to rcdk
Can’t save state of Java objects
Doesn’t take advantage of S4 classes to provide R-side
representations of CDK classes
Incomplete coverage of the CDK API - sometimes need
to go down to rJava to perform an operation
Big datasets are problematic (mainly due to R
limitations)
28. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to chemical databases
Useful to be able to transparently access data from
various public data sources
PubChem compound and assays are supported via
rpubchem
Compound access is primarily by CID, while assay data
can be obtained from key word searches
End up with a data.frame containing all relevant assay
information (along with meta-data as attributes)
R can also easily access arbitrary RDBMS’s (Postgres,
MySQL, Oracle)
29. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to PubChem
§ ¤
> dat <- get.cids (1:30)
’data.frame ’: 30 obs. of 11 variables:
$ CID : chr "1" "2" "3" "4" ...
$ IUPACName : chr "3-acetyloxy -4-( trimethylaz
$ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+](
$ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H
$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.
> find.assay.id(’LDR ’)
[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865
> adat <- get.assay (990)
> str(adat)
’data.frame ’: 51 obs. of 9 variables:
$ PUBCHEM.SID : int 845800 848472 852502 857608
$ PUBCHEM.CID : int 648162 6603466 655127 65895
¦ ¥
30. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Bioinformatics in R
While the focus is on cheminformatics, many problems
involve bioinformatics to some degree
The Bioconductor project provides a wide variety of
packages
A lot of it focused on gene expression analysis
A number of packages provide access to various
biological databases, annotations etc
Protein structure analysis is supported in R via Bio3d
Never have to leave the comfort of R
http://www.bioconductor.org/
Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
31. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Long calculations, big data
Many statistical methods require long running
calculations
Bootstrap
Bayesian methods
Many problems involve large datasets
A common feature to both scenarios is that they can be
trivially parallelized
As opposed to require parallel version of underlying
algorithm
R has good support for both trivial and non-trivial
parallelization methods
See R/parallel for a package that will parallelize
actual R code
Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
32. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple parallelization
The snow package allows easy use of multiple cores on a
single computer or a cluster of computers
A simple wrapper over other parallel R libraries
Can support PVM, MPI
At the very least you can use all the cores on your own
machine
http://cran.r-project.org/web/packages/snow/
33. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Serial code - Feature Selection
Rather than use GA, SA etc, just look at all
combinations
Inelegant, but no worries about missing the global
optimum
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
apply(combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})
¦ ¥
34. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple parallelization - Feature Selection
Trivially parallelized
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
library(snow)
cl <- makeSOCKcluster (2)
clusterExport(cl , "x")
clusterExport(cl , "y")
parApply(cl , combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})
¦ ¥
35. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Big data scenarios
The idea behind snow can also be used to handle very
large datasets
Simply chunk the data appropriately and papply over
the list of filenames
Still requires you to perform chunking and keep track of
everything
Hadoop is a nice way to avoid all this
Throw one or more (very) large files at it, let it deal with
chunking and computation
For non-trivial file formats, you need to implement a
chunker
RHIPE provides access to a Hadoop cluster from within R
http://hadoop.apache.org/core/
36. Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Summary
rcdk successfully integrates cheminformatics
functionality into the R environment
Related packages provide access to other forms of
chemical data (fingerprints) and data sources
An excellent environment for chemical and biological
data mining