Opportunities for HPC in pharma R&D - main deck

October 1st 2015
Opportunities for HPC in pharma
R&D
A Pistoia Alliance webinar
Peter Coveney, Matt Gianni, and Darren Green

This webinar is being recorded

©PistoiaAlliance
Speakers
Opportunities for HPC in pharma R&D 31st October 2015
Prof Peter V. Coveney holds a chair in Physical Chemistry, is an
Honorary Professor in Computer Science, and is Director of the
Centre for Computational Science (CCS) at University College
London (UCL). Coveney is active in a broad area of interdisciplinary
research including condensed matter physics and chemistry,
materials science, as well as life and medical sciences in all of which
high performance computing plays a major role.
Matt Gianni is responsible for representing Cray’s solutions from a
technical and scientific perspective within the Life Science markets.
Over the past 15 years, Matt focused on accelerating drug discovery
using computational technologies and has held key technical roles
with Elsevier, Symyx, MDL and Exelixis prior to joining Cray.
Darren Green is Director of Computational Chemistry,
GlaxoSmithKline. Based at Stevenage, his group specialises in the
application of molecular design, data analysis, predictive modelling
and chemoinformatics methods to drug discovery. Darren also leads
the Compound Collection Enhancement strategy for GSK. Darren
has a PhD in Theoretical Chemistry from the University of
Manchester. He is a Fellow of the Royal Society of Chemistry and a
member of the UK government’s e-Infrastructure Leadership Council.

Opportunities for HPC
in pharma R&D
Peter Coveney
Centre for Computational Science,
University College London
United Kingdom

Drug Screening
Searching for a needle in a haystack
To make use of HPC in pharma R&D:
• Predictions must be rapid, accurate and reproducible
• Requires high performance computing & automation
2

A virtual screening tool — a binding affinity calculator (BAC) —
is able to reliably predict binding affinities of compounds with
target proteins, and can be used potentially as a drug ranking
tool in pharmaceutical lead discovery or in clinical application.
Blackbox-like
BAC
Ranking of
binding
affinities
Virtual Screening Tools Based on Molecular Dynamics
6
S. K. Sadiq, D. Wright, S. J. Watson, S. J. Zasada, I. Stoica, Ileana, and P. V. Coveney, "Automated
Molecular Simulation-Based Binding Affinity Calculator for Ligand-Bound HIV-1 Proteases", Journal of
Chemical Information and Modeling, 48, (9), 1909-1919, (2008), DOI: 10.1021/ci8000937.
The virtual screening tool requires a combination of hardware and software.

BAC: rapid and accurate binding affinity calculation on timescales relating to
pharmaceutical lead discovery.
The architecture is that of an HPC machine (either multicore or manycore/GPU based).
Rapid, Accurate, Reproducible and Automatic
7
Total of 10,000 cores on
HPC/cloud resources
required per study
Reproducible:
Two independent studies of the
same target and ligands
produce identical results.
Less than
10 wallclock
hours

8
Drug Ranking with Schrödinger
Schrödinger products with capability of binding affinity calculations:
̶ Desmond: High-performance molecular dynamics simulations for
biomolecular systems
̶ FEP+: a rigorous approach for computing binding free energies that
provides significant value to industrial drug discovery efforts
Both Desmond and FEP+ support GPGPUs.
The Schrödinger Suite comprises (proprietary) software and runs on low end
GPU boards. FEP is one of the methods available for binding affinity prediction.
It has a limited domain of validity (congeneric series, same charges at both end
points, etc.); reproducibility remains an issue.
Our own capabilities, which are based on a BAC for “MMPBSA” and TI,
address reproducibility through the requirement to perform large scale
ensembles of molecular dynamics calculations. This calls for HPC
architectures, whether multicore or GPGPU [i.e. we exploit BIG machines,
not lower end resources (whether GPU or multicore, these cannot support the
turn around required)]

Automation & Integration of Services
9
The BAC workflow requires resources of different scales to execute
Project
data
warehouse
Coordinating Workflow Engine
BAC
Prepare
BAC Simulate BAC Post
Process
EGI/Cloud
resources AHE
PRACE
Resources
Result
EUDAT Data Staging Services
Long term
EUDAT
storage
EGI/Cloud
resources

10
Genome Sequencing:
1
0
1 Human Genome in:
~5 years (2001)
2 years (2004)
4 days (Jan 2008)
16 Hours (Oct 2008)
3 Hours (Nov 2009)
6 minutes (recent)
Big Data in Biomedicine & Healthcare
Use of HPC in the context of genomics
and gene sequencing
• Electronic Health Records
• Integration of omics & imaging data
Requires rapid development of computational
science and informatics capabilities to deal
with management and analysis of data.
New Machines

11
Cray Solutions for Life Sciences
Healthcare Provider: The Promise of Precision Treatment
Cray® XK7™ supercomputer
Cray’s Urika™ platform
Case study:
Oak Ridge National Laboratory (ORNL) is using
computing to delve deeper into big health data and
is proposing innovative solutions to grand
challenges in the country's health care system.
ORNL researchers are using Titan to simulate
outcomes of interventions, Urika for pattern
discovery, and cloud computing to understand what
happened.

Petascale Computing Facilities Used by Us
12
Kraken Stampede Lonestar Anton
HECToR
PRACE
ARCHER
EMERALD
Blue Joule Blue Wonder
GPGPU
cluster

C O M P U T E | S T O R E | A N A L Y Z E
Pistoia Alliance 2015
Oct 1, 2015

About Cray
Cray Inc.
Seymour Cray founded Cray Research in 1972
• 1972-1996, Cray Research grew to leadership in Supercomputing
• 1996-2000, Cray was subsidiary of SGI
• 2000- present, Cray Inc. growing to $525M in revenue in 2013
• Cray Inc. formed in April 2000
Cray Inc.
• NASDAQ: CRAY
• Over 1,000 employees across 30 countries
• Headquartered in Seattle, WA
Three Focus Areas
• Computation
• Storage
• Analytics
Seven Major
Development Sites:
• Austin, TX
• Chippewa Falls, WI
• Pleasanton, CA
• St. Paul, MN
• San Jose, CA
• Seattle, WA
• Bristol, UK

Cray’s Vision:
The Fusion of Supercomputing and Big & Fast Data
Modeling The World
Cray Supercomputers solving “grand challenges” in science, engineering and analytics
Compute Store Analyze
Data-
Intensive
Processing
High throughput event
processing & data
capture from sensors,
data feeds and
instruments
Math Models
Modeling and
simulation augmented
with data to provide
the highest fidelity
virtual reality results
Data Models
Integration of datasets
and math models for
search, analysis,
predictive modeling
and knowledge
discovery
Cray Inc.

Cray Product Range and LS Applicability
 Aries Interconnect
 Scalability
 Package density
 Accelerators
 Upgradeability
 Integrated Stack
 Best in class power
and cooling
 Accelerator density
 Proven at scale
 Integrated h/w and
s/w stack
 Developer
productivity
CS400/Storm
Cluster
Supercomputer
XC40
Supercomputer
 Molecular Modeling
 Structural Biology
 Machine Learning
 NGS
 Bioinformatics
 Image Analysis
 Molecular Modeling
 Structural Biology
 Machine Learning
 NGS
 Bioinformatics
 Image Analysis

Unprecedented Scalability
18
Source: Jim Phillips, SC’12
Image: http://www.ks.uiuc.edu/Research/vmd/minitutorials/gelato/
Cray Inc. Proprietary
10/1/2015
Satellite Tobacco
Mosaic Virus

Applying HPC Best Practice to Speed Up MegaSeq
19
Megan Puckelwartz, et al. (University of Chicago)
exploit the fact that the cost of sequencing an entire
human genome is now moving into the range where it is
being broadly applied in both the research and clinical
settings
http://beagle.ci.uchicago.edu/science-at-beagle/
Cray Inc. Proprietary10/1/2015
Puckelwartz M J et al. Bioinformatics
2014;bioinformatics.btu071
Parallelization

Project: Parallelizing Inchworm
20
Butterfly computes the final
assembly
M. G. Grabherr, et a., Nat. Biotechnol.; 29(7): 644-652, 2013
Trinity: software tool developed for de novo reconstruction of transcriptome
from RNA-seq data
Chrysalis bundles the contigs and
builds individual de Bruijn graphs
Inchworm uses a greedy
algorithm search on k-mer graph
to assemble sequence contigs
http://trinityrnaseq.sourceforge.net/
Dr Pierre Carrier, Dr Carlos P Sosa, Dr Bill Long, Dr Brian Haas,
Dr Timothy Tickle
10/1/2015 Cray Inc. Proprietary
R. Henschel, et al., Trinity RNA-Seq assembler performance optimization. XSEDE 2012 Proceedings of the 1st Conference of the Extreme Science and
Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond

 Lustre parallel file system
 Single POSIX namespace
 Modular scaling 7.5GB/s-1.7TB/s
 Integrated and preconfigured
 Reliability and availability at scale
 Multi tier single namespace archive
 Rule based policy migration
 Flexible integration with most OEM
tape and disk
 Preconfigured and integrated
Archive
Lustre Parallel File System  Improved Scalability
 Converged storage across
grid, analytics, Hadoop
 Storage layer for
Cassandra, Spark, RDB
 Improve I/O
 Data Lake archival
 Analytical data archival
 Market data archival
 Data no longer ‘deep
sixed’
 NGS data archive

 Most scalable graph
processor available
 Whole graph analytics
possible
 Open RDF/Sparql
 Single memory space
and extreme threaded
processor
Urika-GD
Graph Discovery
Appliance
 Precision Medicine
 Drug Repurposing
 Cybersecurity
 Data Integration
 Cohort Selection
 Cloudera 5.2/Yarn
 Open to non CDH apps
 Dense compute and
memory
 SSD layer for HDFS
 Lustre/Posix for scale
out storage
Urika-XA
Extreme Analytics
Platform
 Spark optimized
 R/T streaming analytics
converged with regular
analytics
 Machine learning
 NGS Workflow and
Analytics

Life Science Market and Technology Drivers
New data sources and emerging analytical approaches to
enable predictive modeling and knowledge discovery
Convergence of analytics and supercomputing opening
new opportunities to meet the pace of discovery
Ad-hoc cluster infrastructures increasing complexity,
reliability and usability challenges
Struggling to keep compute infrastructures current, with
rapidly changing life sciences technologies
Race to understand patients, diseases and treatments, at
the molecular level
Precision
Medicine
Pace of
Technology
Cluster
Sprawl
Rise of High
Performance Analytics
Data Science

The Quest for In-Time Analytics
Responsetimeframes
<30ms
30ms
10min
>10min
Low-Latency
Batch
Few data
scientists who
wrangle data
Business
analysts
accustomed to
interactive time
frames
Streaming data
Stationary data
Low-latency applications require performance optimizations
• Memory-storage hierarchies
• Fast interconnects

Explosion in Data Volume, Variety and
Complexity

Explosion in Data Volume, Variety and
Complexity
ELN
Medical Records

Existing tools are failing to keep up

Modern NGS Multi-Step Analytics Pipelines
Next Generation
Sequencers
• mRNA
• miRNA
• Protein
• SNP
• Metabolite
Data Prep/
Acquisition
• Background
Correction
• Normalization
• QC
• SNP call
Base
Analytics
• dbSNP
• ClinVar
• Annovar
• Uniprot
• Biobank/LIMS
Contextualization
• Correlation
analysis
• Regression
• Hypothesis
testing
• Visualization
Advanced
Analytics
Actionable
Insight
Trial Data
Drug Data
“Big Data”
Patient Data
Data Data Data

Actionable
Insight
Cray Multi-Step Analytics Pipelines: Manage all
aspects of NGS pipeline in one environment
Data Prep/
Acquisition
• mRNA
• miRNA
• Protein
• SNP
• Metabolite
Base
Analytics
• Background
Correction
• Normalization
• QC
• SNP call
Data
Integration
• dbSNP
• ClinVar
• Annovar
• Uniprot
• Biobank/LIMS
Advanced
Analytics
• Correlation
analysis
• Regression
• Hypothesis
testing
• Visualization

Apache Spark Enables Modern Bioinformatics
MLlib
SQL

Connecting to the Enterprise
Big Data Platform
BI on HadoopBI and visualization Advanced analytics Data transformation
Data sources

• Memory - Urika-XA’s is configured with 6TB per rack supports
complex NGS workflows and provides the freedom to model data
based on the requirements of the analysis, as opposed to the
limitations of the machine
• Compute - Urika-XA’s provides over 1,500 cores per rack
bringing complex analysis of big data to interactive time scales
• Network – Urika-XA’s high speed interconnects accelerate
complex data joins and graph analytics at scale
• Storage
• Lustre – 120 TB of global POSIX compliant file system
• SSD – 38 TB of high speed local SDD storage
Urika-XA enables Spark

BioDT is an Open Platform
•Users aren’t relegated to a limited set of
proprietary tools
•Includes 250+ popular tools, including
tools from the Galaxy and GATK libraries
•Supports ADAM
•Easy to add new tools
•Easy to optimize tools for Hadoop
•Easy to search tools
•Tools can be R, PERL, or Python scripts

Lumenogix Bioinformatics-in-a-Box™ with Urika-XA
Whole Human Genome in 45 minutes
50x Whole Human
Genome
164
45
0
20
40
60
80
100
120
140
160
180
AWS Urika-XA
Minutes
Time to process 50x Whole Human Genome
Process Time
BWA 17 minutes
Tag & Shuffle Reads 2 minutes
Sort and Compress 1 minute
Mark Duplicates 1 minute
Realignment 6 minutes
Genotyping 18 minutes
Total 45 minutes

In a Cancer Biology Research Project
● Understanding the
relationship between
genes, does gene X
regulate the expression
of gene Y?
● How do mutations affect
these relationships?
● What is the effect on the
cell cycle?
● What are the effects on
genome stability.
Original t-test analysis
using R running on a
gene-by-gene basis
• A single gene takes ~
1 – 3 minutes to
analyze
• At this rate it would
take 25 days to
complete the entire
36K sample
experiment
Using Spark on Urika-XA
• Implemented the t-test
using Scala and the
Apache Commons
Mathematics Library,
in parallel, across
1,500 cores
• Completing the entire
experiment in under
20 minutes
• Bioinformatician-
friendly code
• Interactive
environment

Analytics Solutions
38
Powered By
Extreme Analytics Platform
• Turnkey Advanced Analytics Platform
• Next-Generation System Architecture
• Engineered for Performance
Graph Discovery Appliance
• Discover Unknown & Hidden
Relationships in Big Data
• Real-time Data Discovery
• Realize Rapid Time-to-Value

Thank you
39

Panel & audience discussion
Please enter your questions into the question or chat boxes

info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org
Thank you for your attention

Opportunities for HPC in pharma R&D - main deck

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Opportunities for HPC in pharma R&D - main deck

Ähnlich wie Opportunities for HPC in pharma R&D - main deck (20)

Mehr von Pistoia Alliance

Mehr von Pistoia Alliance (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Opportunities for HPC in pharma R&D - main deck