This document summarizes a method for using cloud computing resources to efficiently explore large model spaces for quantitative structure-activity relationship (QSAR) modeling. Key points:
- The method uses e-Science Central and Windows Azure to run QSAR modeling workflows in parallel across many nodes, allowing exploration of large model spaces.
- Over 250,000 models were generated exploring different modeling methods (e.g. linear regression, neural networks) across 460,000 workflow executions and 4.4 million service calls.
- Scaling to 200 nodes reduced modeling time from over 11 days to under 2 hours, demonstrating near-linear speedups from additional nodes.
1. Fast Exploration of the QSAR
Model Space with e-Science
Central and Windows Azure
Simon Woodman
Jacek Cala
Hugo Hiden
Paul Watson
2. VENUS-C
Developing technology to ease scientific
adoption of Cloud Computing
• EU Funded Project
– 11 Partners
– 5 Technology/Infrastructure
– 6 Scenario Partners
– Open Call – 20 Pilot Studies
– May 2010 to May 2012
3. Architrave
AEGEAN
UPV.Bio
UNEW
COLB
CoSBI
CNR
Scenario /
Algorithm
Users Programming .NE .NE
C++ C++ Java C++ Java
Language T T
Type of Parameter Map /
Batch HTC Workflow Data flow CEP
workload sweep Reduce
Execution
VENUS- Environments EMIC Generic Worker BSC COMPS
C
Operating
Windows Windows Linux
BSC Super Computer
System
Windows
(not in the cloud)
Azure EMOTIV
Infra-
(not in the cloud)
Cloud
OpenNebula …
On Premises
Technology E
structure
Cloud
Paradigm PaaS IaaS
Cloud Custome
Provider MSFT ENG KTH BSC
r
4. The Problem
What are the properties of this molecule?
Toxicity
Biological Activity
Solubility
Perform experiments
Time consuming
Expensive
Ethical constraints
5. The alternative to Experiments
Predict likely properties based on similar molecules
CHEMBL Database: data on 622,824 compounds,
collected from 33,956 publications
WOMBAT Database: data on 251,560 structures,
for over 1,966 targets
WOMBAT-PK Database: data on 1230 compounds,
for over 13,000 clinical measurements
All these databases contain structure information and numerical activity data
6. QSAR
QSAR
Quantitative Structure Activity Relationship
Activity ≈
f( )
More accurately, Activity related to a quantifiable structural attribute
Activity ≈ f( logP, number of atoms, shape....)
Currently > 3,000 recognised attributes
http://www.qsarworld.com/
8. Branching Workflows
Partition training & test Random split
data 80:20 split
Calculate descriptors Java CDK descriptors
C++ CDL descriptors
Correlation analysis
Select descriptors Genetic algorithms
Random selection
Build model Linear regression
Neural Network
Partial Least Squares
Classification Trees
Add to database
9. e-Science Central
Platform for cloud based data analysis
Azure
Java
EC2
R
On Premise
Octave
Javascript
10. Architecture
<<web role>>
Generic Worker
e-SC control data
Workflow
engine
web web
browser browser
rich client
app <<web role>>
<<web role>>
Generic Worker
QSAR
workflow data
<<Azure VM>> Workflow Explorer
engine
Web UI REST API
e-Science
e-SC blob Central
main server JMS queue <<web role>>
Azure Blob
store
Generic Worker store
Workflow
engine
workflow invocations
e-SC db
backend
<<Azure VM>>
11.
12. Workflow Architecture
Worker Role
• Single Message Install JRE
Queue Install wf engine
– Worker Failure Execute the engine
Semantics Get Job from
Queue
– Elasticity
Deploy Runtime?
• Runtime
Get Data
Environments
Execute Job
–R
– Octave Put Data
– Java Put Next Jobs on
Queue
• Deployed only once
13. Results
• 250k models • QSAR Explorer
– Linear Regression – Browse
– PLS – Search
– RPartitioning – Get Predictions
– Neural Net
• 460K workflow
executions
• 4.4M service calls
14. Scalability: Large Scale QSAR
16:48
480 datasets sequential time: 11 days
GW
100 Nodes 200 Nodes
14:24 Azur
e
Response Time 3hr 19mins 1hr 50mins
Speedup 94x 156x 12:00
Execution time [hh:mm]
Efficiency 94% 78%
09:36
Cost $55.68 $51.84
250.0
07:12
200.0
Relative processing speed-up
04:48
150.0
100.0 02:24
50.0 Azure
ideal 00:00
GW 0 50 100 150 200 250
0.0
Number of processors
0 50 100 150 200 250
Number of processors
15. Cloud Applicability
• Bursty
– ChEMBLdb updates (delta 10%)
– New Modelling Methods (???)
• Performance depends on how chatty the
problem is
– Deploy (incl download) dependencies once
– Avoid storage bottlenecks
16. Performance is great but …
Drug Development requires us to capture
the data and the process
17. Provenance/Audit
Requirements
• How was a model generated?
– What algorithm?
– What descriptors
• Are these results reproducible?
• How have bugs manifested?
– Which models affected
– How do we regenerate affected models?
• Performance Characteristics
• How do we deal with new data?
18. Storing Provenance
• Neo4j
– Open Source Graph Database
– Nodes/Relationships + properties
– Querying/traversing
• Access
– Java lib for OPM
– e-SC library built on top of OPM lib
– REST interface
• Options for HA and Sharding for
performance
19. Provenance Model
• Based on OPM
– Processes, Artifacts, Age
nts
• Directed Graph
• Multiple views of
provenance
– Dependent on security
privileges
20. Adding new model builders
1. Add new block
Enumerate
2. Mine the provenance descriptors
1
n n
3. Dynamically create Build and
cross-validate
Build and cross-
validate a new
RPart-m kind of model
virtual workflows 1
1
4. One invocation per cross-validation
data set 1
1
Test RPart-m Test ?-m
• Work in progress…
21. Future Work
• Scalability and reliability
– SQL Azure
– Application server replication
• Provenance visualization
• Meta-QSAR
– Provenance Mining
• Cloud4Science
– Applying lessons learned to new scenarios
22. MOVEeCloud Project
• Investigating the links between
physical activity and common
diseases – type 2
diabetes, cardiovascular
disease,…
• Wrist accelerometers worn over 1
week period
• Measures movement at 100Hz in
three axes
• Processing ideal for Azure
– Bursty data processing as new
data gathered
– Embarrassingly parallel
– Large datasets
23. MOVEeCloud Process
Analysis and
Classification
R, Java, Octave
Walkin Sedentar
Sleep Sedentary Activity
g y
Methodology
Clinician’s Patient
Section for
Report Interventions
Papers
24. Data Sizes
100 samples / second 100 rows
3600 seconds / hour 360,000 rows
24 hours / day 8,640,000 rows
7 days / study 60,480,000 rows
/ patient / visit
Cohort size of 800 patients and multiple visits
25. Working with larger data sets
• As we add more workflow engines server
load increases
– One server can cope 200 engines if files are
small
• This is not the case with movement data
– Only support 4 engines
• Increase the bandwidth to the engines
– Clustering appserver /database?
26. HDFS
• Implemented prior to Native HDFS on Azure
• Easy to integrate with e-sc
– Java system just requires libraries included in e-sc
• Distributed store where bandwidth increases with number of
machines
– Bits of data spread around lots of machines
• Concept of data location
– Potential to route workflows to execute as close as possible to
storage
• Other applications also also built on top of HDFS
– Open TSDB to store timeseries for movement data
27. Drawbacks to HDFS
• Needs a NAMENODE to co-ordinate everything
– Single point of failure (in current HDFS)
• Metadata stored in RAM
– doesn’t scale beyond a million or so files
– Bad for drug discovery)
• Not particularly efficient for small files
– There is an overhead to connecting to filesystem
• If instances terminate can lose data if not backed up
– Redundancy helps
– Backing in Cloud and stage to HDFS for experiments
– Might use it is a cache of “Hot” files and use Blobstore/S3
to back it all up
29. System Setup
• One machine for the e-sc server
• 4 CPUs, 7GB RAM, 1TB Local storage
• One machine for Namenode
• 4 CPUs, 7GB RAM, 1TB local storage
• Four workflow engines
• 2 CPUs, 3.5 GB RAM, 500GB local storage
• 2TB HDFS storage using workflow engines
• Mounts up quickly
• Increase priority of HDFS on engines
• Competing for resources with workflow
30. Initial Results
For a single data set processing went from 60 to 16 minutes
using 4 workflow engines running HDFS
• 4 engines the limit for one e-sc server
• Main server hit 100% CPU delivering data
• No further improvements with more engines
• Using HDFS CPU was consistently below 5%
• More like our earlier scalability results
• Once data had been chunked processing was the same for
each chunk
• The improvement lay entirely in staging and uploading results
31. Questions?
• Thank you to our generous funders
– Microsoft Cloud 4 Science
– EU FP7 - VENUS-C (RI-261565)
– RCUK – SiDE (EP/G066019/1)
• The Team
– Jacek Cala
– Paul Watson
– Hugo Hiden
Editor's Notes
[JC] Shouldn’t the title be “e-SC in Azure”?
Work Stealing vs Work scheduling
Important for drug discovery due to traceability requirements
Natural fit for storing provenance as it’s a graph to start with
C4S – Mutation Detection, NGS, e-SC available on Azure, data sets for bioinformatics
Large datasets need smart moving of data around – HDFS? otherwise hit the limit on storage account b/w
Single POFNot great for small files or 1M+ filesInstances can terminate resulting in loss of data