1. Cloudy
with
a
Touch
of
Cheminforma4cs
Rajarshi
Guha,
Tyler
Peryea,
Dac-‐Trung
Nguyen
NIH
Center
for
Advancing
Transla@onal
Science
Chemaxon
UGM
September
26th,
2012
Wellesley,
MA
2. Parallel
compu4ng
in
the
cloud
• Modern
cloud
vendors
make
provisioning
compute
resources
easy
– Allows
one
to
handle
unpredictable
loads
easily
– Pay
only
for
what
you
need
• Chemistry
applica<ons
don’t
usually
have
very
dynamic
loads
• But
large
scale
resources
are
an
opportunity
for
large
scale
(parallel)
computa<ons
3. All
HPC
is
not
equal
• Use
cloud
resources
in
• Make
use
of
cloud
• Huge
datasets
the
same
way
as
a
local
capabili<es
• Candidates
for
map-‐
cluster
• Old
algorithms,
new
reduce
• MIT
StarCluster
makes
infrastructure
• Involves
algorithm
this
easy
to
do
• Spot
instances,
SNS,
(re)design
SQS
SimpleDB,
S3,
etc
Legacy
Cloudy
Big
Data
HPC
HPC
HPC
hOp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa<cs-‐to-‐the-‐cloud
4. Big
data
&
cheminforma4cs
• Computa<on
over
large
chemical
databases
– Pubchem,
ChEMBL,
GDB-‐13,
…
• What
types
of
computa<ons?
– Searches
(substructure,
pharmacophore,
….)
– QSAR
models
&
predic<ons
over
large
data
• Fundamentally,
“big
chemical
data”
lets
us
explore
larger
chemical
spaces
5. Map-‐Reduce
copy
sort
Split 0 Map
merge
Reduce Part 0
Split 1 Map
merge
Reduce Part 1
Split 2 Map
K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 )
Tom
White,
Hadoop,
The
Defini/ve
Guide.
3rd
Ed.
O’Reilly
6. Coun4ng
atoms
• The
chemical
version
of
the
word
coun<ng
task
Arbitrary line Atom list (V2)
SMILES (V1) Atom
numbers (K1) Occurence (V2) Symbol (K2)
Symbol (K2)
1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...)
2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...)
. N1
. N1
. N1
152366, Nc1ccc2ncccc2c1N MAP
. Reduce
.
Atom
Count (V3)
Symbol (K3)
N,100
C,5684
.
.
.
7. The
Hadoop
ecosystem
Chukwa Zookeeper Flume Pig
HBase Mahout Avro Whirr
Map Reduce Engine Hama
Hadoop Distributed
Hive
Filesystem
Hadoop Common
Based
on
hOp://www.slideshare.net/informa<cacorp/101111-‐part-‐3-‐maO-‐asleO-‐the-‐hadoop-‐ecosystem
8. Cheminforma4cs
on
Hadoop
• Hadoop
and
Atom
Coun<ng
• Hadoop
and
SD
Files
• Cheminforma<cs,
Hadoop
and
EC2
• Pig
and
Cheminforma<cs
But
are
cheminforma@cs
problems
really
big
enough
to
jus@fy
all
of
this?
9. Simplifying
Hadoop
applica4ons
package gov.nih.ncgc.hadoop;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
import chemaxon.formats.MolFormatException;
reporter) throws IOException {
import chemaxon.formats.MolImporter;
• Raw
Hadoop
Molecule mol = MolImporter.importMol(value.toString());
import chemaxon.license.LicenseManager;
matches.set(mol.getName());
import chemaxon.license.LicenseProcessingException;
search.setTarget(mol);
import chemaxon.sss.search.MolSearch;
try {
import chemaxon.sss.search.SearchException;
if (search.isMatching()) {
import chemaxon.struc.Molecule;
output.collect(matches, one);
import org.apache.hadoop.conf.Configuration;
} else {
programs
can
import org.apache.hadoop.conf.Configured;
output.collect(matches, zero);
import org.apache.hadoop.filecache.DistributedCache;
}
import org.apache.hadoop.fs.Path;
} catch (SearchException e) {
import org.apache.hadoop.io.IntWritable;
}
import org.apache.hadoop.io.LongWritable;
}
import org.apache.hadoop.io.Text;
}
import org.apache.hadoop.mapred.FileInputFormat;
be
tedious
to
import org.apache.hadoop.mapred.FileOutputFormat;
public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text,
import org.apache.hadoop.mapred.JobClient;
IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapred.JobConf;
private IntWritable result = new IntWritable();
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
public void reduce(Text key,
import org.apache.hadoop.mapred.OutputCollector;
Iterator<IntWritable> values,
import org.apache.hadoop.mapred.Reducer;
OutputCollector<Text, IntWritable> output,
write
import org.apache.hadoop.mapred.Reporter;
Reporter reporter) throws IOException {
import org.apache.hadoop.mapred.TextInputFormat;
while (values.hasNext()) {
import org.apache.hadoop.mapred.TextOutputFormat;
if (values.next().compareTo(one) == 0) {
import org.apache.hadoop.util.Tool;
result.set(1);
import org.apache.hadoop.util.ToolRunner;
output.collect(key, result);
}
import java.io.BufferedReader;
}
import java.io.FileReader;
}
import java.io.IOException;
}
import java.util.Iterator;
public int run(String[] args) throws Exception {
/**
JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class);
* SMARTS searching over a set of files using Hadoop.
jobConf.setJobName("smartsSearch");
*
* @author Rajarshi Guha
jobConf.setOutputKeyClass(Text.class);
*/
jobConf.setOutputValueClass(IntWritable.class);
public class SmartsSearch extends Configured implements Tool {
private final static IntWritable one = new IntWritable(1);
jobConf.setMapperClass(MoleculeMapper.class);
private final static IntWritable zero = new IntWritable(0);
jobConf.setCombinerClass(SmartsMatchReducer.class);
jobConf.setReducerClass(SmartsMatchReducer.class);
public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
jobConf.setInputFormat(TextInputFormat.class);
private String pattern = null;
jobConf.setOutputFormat(TextOutputFormat.class);
private MolSearch search;
jobConf.setNumMapTasks(5);
public void configure(JobConf job) {
if (args.length != 4) {
try {
System.err.println("Usage: ss <in> <out> <pattern> <license file>");
Path[] licFiles = DistributedCache.getLocalCacheFiles(job);
System.exit(2);
BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString()));
}
StringBuilder license = new StringBuilder();
String line;
FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
while ((line = reader.readLine()) != null) license.append(line);
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
reader.close();
jobConf.setStrings("pattern", args[2]);
LicenseManager.setLicense(license.toString());
} catch (IOException e) {
// make the license file available vis dist cache
} catch (LicenseProcessingException e) {
DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);
}
JobClient.runJob(jobConf);
pattern = job.getStrings("pattern")[0];
return 0;
search = new MolSearch();
}
try {
Molecule queryMol = MolImporter.importMol(pattern, "smarts");
public static void main(String[] args) throws Exception {
search.setQuery(queryMol);
} catch (MolFormatException e) {
int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);
}
}
SMARTS
based
}
}
final static IntWritable one = new IntWritable(1);
Text matches = new Text();
substructure
search
10. Pig
&
Pig
La4n
• Pig
La<n
programs
are
much
simpler
to
write
and
get
translated
to
A = load 'medium.smi' as (smiles:chararray);
B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
store B into 'output.txt';
Hadoop
code
SMARTS
search
in
Pig
La<n
• SQL-‐like,
requires
package gov.nih.ncgc.hadoop.pig;
import chemaxon.formats.MolImporter;
UDF
to
be
import chemaxon.sss.search.MolSearch;
import chemaxon.sss.search.SearchException;
import chemaxon.struc.Molecule;
import org.apache.pig.FilterFunc;
implemented
to
import org.apache.pig.data.Tuple;
import java.io.IOException;
perform
public class SMATCH extends FilterFunc {
static MolSearch search = null;
non-‐standard
tasks
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() < 2) return false;
String target = (String) tuple.get(0);
String query = (String) tuple.get(1);
try {
Molecule queryMol = MolImporter.importMol(query, "smarts");
search.setQuery(queryMol);
search.setTarget(MolImporter.importMol(target, "smiles"));
return search.isMatching();
} catch (SearchException e) {
e.printStackTrace();
}
return false;
}
} UDF
for
SMARTS
search
11. Going
beyond
chunking?
• All
the
preceding
use
cases
are
embarrassingly
parallel
– Chunking
the
input
data
and
applying
the
same
opera<on
to
each
chunk
– Very
nice
when
you
have
a
big
cluster
Are
there
algorithms
in
cheminforma@cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
12. Going
beyond
chunking?
• Applica<ons
that
make
use
of
pairwise
(or
higher
order)
calcula<ons
could
benefit
from
a
map-‐
reduce
incarna<on
– Doesn’t
necessarily
avoid
the
O(N2)
barrier
– Bioisostere
iden<fica<on
is
one
case
that
could
be
rephrased
as
a
map-‐reduce
problem
• Map-‐Reduce
Design
PaOerns
13. Iden4fying
MMPs
• First
step
in
iden<fying
bioisosteres
is
to
iden<fy
candidate
matched
molecular
pairs
– Naïve
all
pairs
comparison
– Predefined
list
of
transforma<ons
• Birch
et
al,
BMCL,
2009
– Fragment
intersec<on
• Hussain
et
al,
JCIM,
2010
– MCS
based
approaches
(e.g.,
WizePairZ)
• Warner
et
al,
JCIM,
2010
17. Seeded
bioisosteres
–
MR
style
• Do
pairwise
MCS
REDUCE
analysis
on
scaffold
• Collect
pairs
of
series
SMILES
for
a
given
• For
each
pair
SMIRKS
output
SMIRKS
• Store
in
DB,
or
transform
and
the
pair
of
SMILES
• Filter
by
ac<vity,
or
• …
MAP
18. Does
seeding
help?
• Doesn’t
bypass
the
O(N2)
barrier
–
does
reduce
the
constant
• Depends
on
how
many
scaffolds
and
the
number
of
member
for
1e+14
each
scaffold
• Certainly
useful
when
log Number of pairwise comparisons
1e+11
there
a
few
members
Method
per
scaffold
1e+08
all
seeded.7
seeded.21
• Highly
populated
seeded.100
scaffolds
can
throw
things
off
1e+05
1e+03 1e+05 1e+07
log Number of molecules
19. Data
• Exhaus<vely
fragmented
ChEMBL
13
• Iden<fied
scaffolds
with
N members
! 1.8
N scaffold
• Ended
up
with
231,875
scaffolds
1e+08
– Covers
235,693
unique
molecules
log Comparisons
– Average
of
7
members
per
scaffold
1e+05
– 95%
of
scaffolds
had
<
21
members
– 99.5%
had
<
74
members
1e+02
• The
0.05%
are
a
bit
problema<c
All Seeded
Method
20. Timing
experiments
• Selected
50
scaffolds
with
10
or
fewer
members
• Configured
so
as
to
have
~
5
maps
• Effec<ve
running
<me
for
the
en<re
job
is
3.8
min
200
on
Hadoop
150
– Only
needed
5
of
8
map
slots
on
our
“cluster”
Time (s) 100
• Takes
~
6
min
without
50
Hadoop
0
1 2 3 4 5
Job Number
21. Timing
experiments
• Selected
1000
scaffolds
with
20
or
fewer
members
– Ran
with
10
scaffolds
/
map
• Hadoop
run
<me
was
~
2
hr
15
– Most
maps
were
Number of Jobs
10
fast
(<
20
sec)
• Serial
evalua<on
5
would
be
>
7
hr
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log Time (s)
22. A
M-‐R
workflow
• We’re
currently
focused
on
just
the
MMP
step
as
as
a
MR
example
• Could
also
include
fragmenta<on
step
as
part
of
the
workflow
– But
a
pre-‐calculated
set
of
scaffolds
is
more
sensible
• Store
transforma<ons
and
members
in
HBase
• Link
with
ac<vity
data
and
apply
structure
&
ac<vity
filters
on
candidate
pairs
23. What
Hadoop
is
not
for
• Doesn’t
replace
an
actual
database
• It’s
not
uniformly
fast
or
efficient
• Not
good
for
ad
hoc
or
real-‐<me
analysis
• Generally
not
effec<ve
unless
dealing
with
massive
datasets
• All
algorithms
are
not
amenable
to
the
map-‐
reduce
method
24. Conclusions
• Cheminforma<cs
applica<ons
can
be
rehosted
or
rewriOen
to
take
advantage
of
cloud
resources
– Remotely
hosted
– Embarrassingly
parallel
/
chunked
– Map/reduce
• Ability
to
process
larger
structure
collec<ons
lets
us
explore
more
chemical
space
• “Big
data”
isn’t
really
that
big
in
chemistry
25. Conclusions
• Q:
But
are
cheminforma/cs
problems
really
big
enough
to
jus/fy
all
of
this?
• A:
Yes
–
virtual
libraries,
integra<ng
chemical
structure
with
other
types
and
scales
of
data
• Q:
Are
there
algorithms
in
cheminforma/cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
• A:
Yes
–
especially
when
we
consider
problems
with
a
combinatorial
flavor