4. Opportunities for HPC
in pharma R&D
Peter Coveney
Centre for Computational Science,
University College London
United Kingdom
5. Drug Screening
Searching for a needle in a haystack
To make use of HPC in pharma R&D:
• Predictions must be rapid, accurate and reproducible
• Requires high performance computing & automation
2
6. A virtual screening tool — a binding affinity calculator (BAC) —
is able to reliably predict binding affinities of compounds with
target proteins, and can be used potentially as a drug ranking
tool in pharmaceutical lead discovery or in clinical application.
Blackbox-like
BAC
Ranking of
binding
affinities
Virtual Screening Tools Based on Molecular Dynamics
6
S. K. Sadiq, D. Wright, S. J. Watson, S. J. Zasada, I. Stoica, Ileana, and P. V. Coveney, "Automated
Molecular Simulation-Based Binding Affinity Calculator for Ligand-Bound HIV-1 Proteases", Journal of
Chemical Information and Modeling, 48, (9), 1909-1919, (2008), DOI: 10.1021/ci8000937.
The virtual screening tool requires a combination of hardware and software.
7. BAC: rapid and accurate binding affinity calculation on timescales relating to
pharmaceutical lead discovery.
The architecture is that of an HPC machine (either multicore or manycore/GPU based).
Rapid, Accurate, Reproducible and Automatic
7
Total of 10,000 cores on
HPC/cloud resources
required per study
Reproducible:
Two independent studies of the
same target and ligands
produce identical results.
Less than
10 wallclock
hours
8. 8
Drug Ranking with Schrödinger
Schrödinger products with capability of binding affinity calculations:
̶ Desmond: High-performance molecular dynamics simulations for
biomolecular systems
̶ FEP+: a rigorous approach for computing binding free energies that
provides significant value to industrial drug discovery efforts
Both Desmond and FEP+ support GPGPUs.
The Schrödinger Suite comprises (proprietary) software and runs on low end
GPU boards. FEP is one of the methods available for binding affinity prediction.
It has a limited domain of validity (congeneric series, same charges at both end
points, etc.); reproducibility remains an issue.
Our own capabilities, which are based on a BAC for “MMPBSA” and TI,
address reproducibility through the requirement to perform large scale
ensembles of molecular dynamics calculations. This calls for HPC
architectures, whether multicore or GPGPU [i.e. we exploit BIG machines,
not lower end resources (whether GPU or multicore, these cannot support the
turn around required)]
9. Automation & Integration of Services
9
The BAC workflow requires resources of different scales to execute
Project
data
warehouse
Coordinating Workflow Engine
BAC
Prepare
BAC Simulate BAC Post
Process
EGI/Cloud
resources AHE
PRACE
Resources
Result
EUDAT Data Staging Services
Long term
EUDAT
storage
EGI/Cloud
resources
10. 10
Genome Sequencing:
1
0
1 Human Genome in:
~5 years (2001)
2 years (2004)
4 days (Jan 2008)
16 Hours (Oct 2008)
3 Hours (Nov 2009)
6 minutes (recent)
Big Data in Biomedicine & Healthcare
Use of HPC in the context of genomics
and gene sequencing
• Electronic Health Records
• Integration of omics & imaging data
Requires rapid development of computational
science and informatics capabilities to deal
with management and analysis of data.
New Machines
11. 11
Cray Solutions for Life Sciences
Healthcare Provider: The Promise of Precision Treatment
Cray® XK7™ supercomputer
Cray’s Urika™ platform
Case study:
Oak Ridge National Laboratory (ORNL) is using
computing to delve deeper into big health data and
is proposing innovative solutions to grand
challenges in the country's health care system.
ORNL researchers are using Titan to simulate
outcomes of interventions, Urika for pattern
discovery, and cloud computing to understand what
happened.
12. Petascale Computing Facilities Used by Us
12
Kraken Stampede Lonestar Anton
HECToR
PRACE
ARCHER
EMERALD
Blue Joule Blue Wonder
GPGPU
cluster
14. C O M P U T E | S T O R E | A N A L Y Z E
Pistoia Alliance 2015
Oct 1, 2015
15. C O M P U T E | S T O R E | A N A L Y Z E
About Cray
Cray Inc.
Seymour Cray founded Cray Research in 1972
• 1972-1996, Cray Research grew to leadership in Supercomputing
• 1996-2000, Cray was subsidiary of SGI
• 2000- present, Cray Inc. growing to $525M in revenue in 2013
• Cray Inc. formed in April 2000
Cray Inc.
• NASDAQ: CRAY
• Over 1,000 employees across 30 countries
• Headquartered in Seattle, WA
Three Focus Areas
• Computation
• Storage
• Analytics
Seven Major
Development Sites:
• Austin, TX
• Chippewa Falls, WI
• Pleasanton, CA
• St. Paul, MN
• San Jose, CA
• Seattle, WA
• Bristol, UK
16. C O M P U T E | S T O R E | A N A L Y Z E
Cray’s Vision:
The Fusion of Supercomputing and Big & Fast Data
Modeling The World
Cray Supercomputers solving “grand challenges” in science, engineering and analytics
Compute Store Analyze
Data-
Intensive
Processing
High throughput event
processing & data
capture from sensors,
data feeds and
instruments
Math Models
Modeling and
simulation augmented
with data to provide
the highest fidelity
virtual reality results
Data Models
Integration of datasets
and math models for
search, analysis,
predictive modeling
and knowledge
discovery
Cray Inc.
17. C O M P U T E | S T O R E | A N A L Y Z E
Cray Product Range and LS Applicability
Aries Interconnect
Scalability
Package density
Accelerators
Upgradeability
Integrated Stack
Best in class power
and cooling
Accelerator density
Proven at scale
Integrated h/w and
s/w stack
Developer
productivity
CS400/Storm
Cluster
Supercomputer
XC40
Supercomputer
Molecular Modeling
Structural Biology
Machine Learning
NGS
Bioinformatics
Image Analysis
Molecular Modeling
Structural Biology
Machine Learning
NGS
Bioinformatics
Image Analysis
18. C O M P U T E | S T O R E | A N A L Y Z E
Unprecedented Scalability
18
Source: Jim Phillips, SC’12
Image: http://www.ks.uiuc.edu/Research/vmd/minitutorials/gelato/
Cray Inc. Proprietary
10/1/2015
Satellite Tobacco
Mosaic Virus
19. C O M P U T E | S T O R E | A N A L Y Z E
Applying HPC Best Practice to Speed Up MegaSeq
19
Megan Puckelwartz, et al. (University of Chicago)
exploit the fact that the cost of sequencing an entire
human genome is now moving into the range where it is
being broadly applied in both the research and clinical
settings
http://beagle.ci.uchicago.edu/science-at-beagle/
Cray Inc. Proprietary10/1/2015
Puckelwartz M J et al. Bioinformatics
2014;bioinformatics.btu071
Parallelization
20. C O M P U T E | S T O R E | A N A L Y Z E
Project: Parallelizing Inchworm
20
Butterfly computes the final
assembly
M. G. Grabherr, et a., Nat. Biotechnol.; 29(7): 644-652, 2013
Trinity: software tool developed for de novo reconstruction of transcriptome
from RNA-seq data
Chrysalis bundles the contigs and
builds individual de Bruijn graphs
Inchworm uses a greedy
algorithm search on k-mer graph
to assemble sequence contigs
http://trinityrnaseq.sourceforge.net/
Dr Pierre Carrier, Dr Carlos P Sosa, Dr Bill Long, Dr Brian Haas,
Dr Timothy Tickle
10/1/2015 Cray Inc. Proprietary
R. Henschel, et al., Trinity RNA-Seq assembler performance optimization. XSEDE 2012 Proceedings of the 1st Conference of the Extreme Science and
Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
21. C O M P U T E | S T O R E | A N A L Y Z E
Cray Product Range and LS Applicability
Lustre parallel file system
Single POSIX namespace
Modular scaling 7.5GB/s-1.7TB/s
Integrated and preconfigured
Reliability and availability at scale
Multi tier single namespace archive
Rule based policy migration
Flexible integration with most OEM
tape and disk
Preconfigured and integrated
Archive
Lustre Parallel File System Improved Scalability
Converged storage across
grid, analytics, Hadoop
Storage layer for
Cassandra, Spark, RDB
Improve I/O
Data Lake archival
Analytical data archival
Market data archival
Data no longer ‘deep
sixed’
NGS data archive
22. C O M P U T E | S T O R E | A N A L Y Z E
Cray Product Range and LS Applicability
Most scalable graph
processor available
Whole graph analytics
possible
Open RDF/Sparql
Single memory space
and extreme threaded
processor
Urika-GD
Graph Discovery
Appliance
Precision Medicine
Drug Repurposing
Cybersecurity
Data Integration
Cohort Selection
Cloudera 5.2/Yarn
Open to non CDH apps
Dense compute and
memory
SSD layer for HDFS
Lustre/Posix for scale
out storage
Urika-XA
Extreme Analytics
Platform
Spark optimized
R/T streaming analytics
converged with regular
analytics
Machine learning
NGS Workflow and
Analytics
23. C O M P U T E | S T O R E | A N A L Y Z E
Life Science Market and Technology Drivers
New data sources and emerging analytical approaches to
enable predictive modeling and knowledge discovery
Convergence of analytics and supercomputing opening
new opportunities to meet the pace of discovery
Ad-hoc cluster infrastructures increasing complexity,
reliability and usability challenges
Struggling to keep compute infrastructures current, with
rapidly changing life sciences technologies
Race to understand patients, diseases and treatments, at
the molecular level
Precision
Medicine
Pace of
Technology
Cluster
Sprawl
Rise of High
Performance Analytics
Data Science
24. C O M P U T E | S T O R E | A N A L Y Z E
The Quest for In-Time Analytics
Responsetimeframes
<30ms
30ms
10min
>10min
Low-Latency
Batch
Few data
scientists who
wrangle data
Business
analysts
accustomed to
interactive time
frames
Streaming data
Stationary data
Low-latency applications require performance optimizations
• Memory-storage hierarchies
• Fast interconnects
25. C O M P U T E | S T O R E | A N A L Y Z E
Explosion in Data Volume, Variety and
Complexity
26. C O M P U T E | S T O R E | A N A L Y Z E
Explosion in Data Volume, Variety and
Complexity
27. C O M P U T E | S T O R E | A N A L Y Z E
Explosion in Data Volume, Variety and
Complexity
ELN
Medical Records
28. C O M P U T E | S T O R E | A N A L Y Z E
Explosion in Data Volume, Variety and
Complexity
ELN
Medical Records
29. C O M P U T E | S T O R E | A N A L Y Z E
Existing tools are failing to keep up
30. C O M P U T E | S T O R E | A N A L Y Z E
Modern NGS Multi-Step Analytics Pipelines
Next Generation
Sequencers
• mRNA
• miRNA
• Protein
• SNP
• Metabolite
Data Prep/
Acquisition
• Background
Correction
• Normalization
• QC
• SNP call
Base
Analytics
• dbSNP
• ClinVar
• Annovar
• Uniprot
• Biobank/LIMS
Contextualization
• Correlation
analysis
• Regression
• Hypothesis
testing
• Visualization
Advanced
Analytics
Actionable
Insight
Trial Data
Drug Data
“Big Data”
Patient Data
Data Data Data
31. C O M P U T E | S T O R E | A N A L Y Z E
Actionable
Insight
Cray Multi-Step Analytics Pipelines: Manage all
aspects of NGS pipeline in one environment
Data Prep/
Acquisition
• mRNA
• miRNA
• Protein
• SNP
• Metabolite
Base
Analytics
• Background
Correction
• Normalization
• QC
• SNP call
Data
Integration
• dbSNP
• ClinVar
• Annovar
• Uniprot
• Biobank/LIMS
Advanced
Analytics
• Correlation
analysis
• Regression
• Hypothesis
testing
• Visualization
32. C O M P U T E | S T O R E | A N A L Y Z E
Apache Spark Enables Modern Bioinformatics
MLlib
SQL
33. C O M P U T E | S T O R E | A N A L Y Z E
Connecting to the Enterprise
Big Data Platform
BI on HadoopBI and visualization Advanced analytics Data transformation
Data sources
34. C O M P U T E | S T O R E | A N A L Y Z E
• Memory - Urika-XA’s is configured with 6TB per rack supports
complex NGS workflows and provides the freedom to model data
based on the requirements of the analysis, as opposed to the
limitations of the machine
• Compute - Urika-XA’s provides over 1,500 cores per rack
bringing complex analysis of big data to interactive time scales
• Network – Urika-XA’s high speed interconnects accelerate
complex data joins and graph analytics at scale
• Storage
• Lustre – 120 TB of global POSIX compliant file system
• SSD – 38 TB of high speed local SDD storage
Urika-XA enables Spark
35. BioDT is an Open Platform
•Users aren’t relegated to a limited set of
proprietary tools
•Includes 250+ popular tools, including
tools from the Galaxy and GATK libraries
•Supports ADAM
•Easy to add new tools
•Easy to optimize tools for Hadoop
•Easy to search tools
•Tools can be R, PERL, or Python scripts
36. C O M P U T E | S T O R E | A N A L Y Z E
Lumenogix Bioinformatics-in-a-Box™ with Urika-XA
Whole Human Genome in 45 minutes
50x Whole Human
Genome
164
45
0
20
40
60
80
100
120
140
160
180
AWS Urika-XA
Minutes
Time to process 50x Whole Human Genome
Process Time
BWA 17 minutes
Tag & Shuffle Reads 2 minutes
Sort and Compress 1 minute
Mark Duplicates 1 minute
Realignment 6 minutes
Genotyping 18 minutes
Total 45 minutes
37. C O M P U T E | S T O R E | A N A L Y Z E
In a Cancer Biology Research Project
● Understanding the
relationship between
genes, does gene X
regulate the expression
of gene Y?
● How do mutations affect
these relationships?
● What is the effect on the
cell cycle?
● What are the effects on
genome stability.
Original t-test analysis
using R running on a
gene-by-gene basis
• A single gene takes ~
1 – 3 minutes to
analyze
• At this rate it would
take 25 days to
complete the entire
36K sample
experiment
Using Spark on Urika-XA
• Implemented the t-test
using Scala and the
Apache Commons
Mathematics Library,
in parallel, across
1,500 cores
• Completing the entire
experiment in under
20 minutes
• Bioinformatician-
friendly code
• Interactive
environment
38. C O M P U T E | S T O R E | A N A L Y Z E
Analytics Solutions
38
Powered By
Extreme Analytics Platform
• Turnkey Advanced Analytics Platform
• Next-Generation System Architecture
• Engineered for Performance
Graph Discovery Appliance
• Discover Unknown & Hidden
Relationships in Big Data
• Real-time Data Discovery
• Realize Rapid Time-to-Value
39. C O M P U T E | S T O R E | A N A L Y Z E
Thank you
39
40. Panel & audience discussion
Please enter your questions into the question or chat boxes