Time to Science/Time to Results: Transforming Research in the Cloud

Accelerating Time to Science:
Transforming Research in the Cloud
Jamie Kinney - @jamiekinney
Director of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services
Michael Franklin - @amplab
Director, AMPLab - UC Berkeley

Agenda
• An introduction to scientific computing on AWS
• How are researchers using AWS today?
• Case study: The UC Berkeley AMP Lab
• Q & A

What do we mean by Scientific Computing?
Scientific Computing refers to the application of simulation,
mathematical modeling and quantitative analysis to analyze and
solve scientific problems.

How is AWS Used for Scientific Computing?
• High Performance Computing (HPC) for Engineering and Simulation
• High Throughput Computing (HTC) for Data-Intensive Analytics
• Hybrid Supercomputing centers
• Collaborative Research Environments
• Citizen Science
• Science-as-a-Service

Why do researchers love using AWS?
Time to Science
Access research
infrastructure in minutes
Low Cost
Pay-as-you-go pricing
Elastic
Easily add or remove capacity
Globally Accessible
Easily Collaborate with
researchers around the world
Secure
A collection of tools to
protect data and privacy
Scalable
Access to effectively
limitless capacity

Why does AWS care about Scientific Computing?
• We want to improve our world by accelerating the pace of scientific
discovery
• It is a great application of AWS with a broad customer base
• The scientific community helps us innovate on behalf of all customers
– Streaming data processing & analytics
– Exabyte scale data management solutions and exaflop scale compute
– Collaborative research tools and techniques
– New AWS regions
– Significant advances in low-power compute, storage and data centers
– Efficiencies which will lower our costs and therefore pricing for all customers

Research Grants
AWS provides free usage
credits to help researchers:
• Teach advanced courses
• Explore new projects
• Create resources for the
scientific community
aws.amazon.com/grants

Peering with all global research networks
Image courtesy John Hover - Brookhaven National Lab

Breaking news! Restricted-access genomics on
AWS
aws.amazon.com/genomics

How are researchers using AWS today?

High Throughput Computing at Scale
The Large Hadron Collider
@ CERN includes 6,000+
researchers from over 40
countries and produces
approximately 25PB of data
each year.
The ATLAS and CMS
experiments are using AWS
for Monte Carlo simulations
and analysis of LHC data.

Data-Intensive Computing
The Square Kilometer Array will link 250,000 radio
telescopes together, creating the world’s most
sensitive telescope. The SKA will generate zettabytes
of raw data, publishing exabytes annually over 30-40
years.
Researchers are using AWS to develop and test:
• Data processing pipelines
• Image visualization tools
• Exabyte-scale research data management
• Collaborative research environments
aws.amazon.com/solutions/case-studies/icrar/

High Performance Computing
Simulations in the Automotive Sector
• Crash and materials simulations
• Fluid and thermal dynamics simulations
• Car body aerodynamics
• Electronics and electromagnetic simulations
Honda materials science simulations on AWS:
• Deploying scalable HPC clusters on AWS Spot – up to 1000 C3 instances
• Running more simulations than before, for more accurate results
“Cloud offers us an opportunity, as we can innovate faster than before.”
- Ayumi Tada, IT System Administrator, Honda R&D

Schrodinger & Cycle Computing:
Computational Chemistry for Better Solar Power
Simulation by Mark Thompson of the
University of Southern California to see
which of 205,000 organic compounds
could be used for photovoltaic cells for
solar panel material.
Estimated computation time 264 years
completed in 18 hours.
• 156,314 core cluster, 8 regions
• 1.21 petaflops (Rpeak)
• $33,000 or 16¢ per molecule
Loosely
Coupled

Science-as-a-Service
Globus Genomics, DNAnexus, and SevenBridges Genomics offer inexpensive, easy-
to-use, and secure platforms for processing and analyzing genomic data.
The Weather Company pushes four gigabytes of data to AWS
each second in order to delivers 15 billion forecasts each day
to their customers around the world.
aws.amazon.com/solutions/case-studies/the-weather-company/

Citizen Science
The Asteroid Data Hunters competition used AWS to develop better mechanisms for
finding near-Earth asteroids. The top algorithm is 18% better at finding asteroids!

Case Study: The UC Berkeley AMP Lab

Scalable Data-Driven
Science at the AMPLab
UC BERKELEY
Michael Franklin
April 9, 2015
AWS Summit SF

AMPLab Overview
• 80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Networking
• 28 Industry Sponsors +
White House Big Data Program:
NSF CISE Expeditions in Computing and Darpa XData
• Founding Sponsors:
“… Berkeley’s AMPLab has already left an indelible mark on the world of
information technology, and even the web. But we haven’t yet experienced
the full impact of the group … Not even close.”
– Derrick Harris, GigaOM, Aug 2, 2014
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldbergCuller

AMPLab: Integrating 3
Resources
Algorithms
• Machine Learning, Statistical Methods
• Prediction, Business Intelligence
Machines
• Clusters and Clouds
• Warehouse Scale Computing
People
• Crowdsourcing, Human Computation
• Data Scientists, Analysts

Berkeley Data Analytics
Stack
(Apache and BSD open source)
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps

Open Source Community Building
MeetUp on MLbase @Twitter (Aug 6, 2013)
Spark Summit SF (June 30, 2014)

Apps: Genomics Patterson et al.
Using BDAS, SNAP (Scalable Nucleotide
Alignment) aligns in minutes vs. days
Why Speed Matters: A real-world use case
ADAM – Data formats and Processing
Patterns for Genomics on Big Data Platforms
(e.g., Spark)
Collaborations with: UCSF, UCSC, OHSU,
Microsoft Research, Mt. Sinai
M. Wilson, …, and C. Chiu, “Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing”,
June 4, 2014, New England Journal of Medicine.
SNA

Carat Collaborative Battery App
24
750,000+
downloads

Big Data Ecosystem
Evolution
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill Tez
Impala
S4 …
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing

AMPLab Unification Philosophy
Don’t specialize MapReduce – generalize it!
Two additions to Hadoop MR can enable all the models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data Movement
Spark
Streaming
GraphX
…
SparkSQL
MLbase

In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.
“It’s only September but it’s already clear that 2014 will
be the year of Apache Spark”
-- Datanami, 9/15/14
• Developed in AMPLab and its predecessor the RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache Spark

Apache Spark Contributors:
0
25
50
75
100
2011 2012 2013 2014
400+ contributors to current release

Apache Spark:
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
500
1000
1500
2000
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
350000
Commits Lines of Code Changed
Activity in past 6 months
2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3,
Julia, …

Iteration in MapReduce
Training
Data
Map Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
Initial
Model

Cost of Iteration in MapReduce
Map Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
Initial
Model
Training
Data
Read 2
Repeatedly
load same data

Cost of Iteration in MapReduce
Map Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
Initial
Model
Training
DataRedundantly save
output between
stages

Dataflow View
Training
Data
(HDFS)
Map
Reduc
e
Map
Reduc
e
Map
Reduc
e

Memory Opt. Dataflow
Training
Data
(HDFS)
Map
Reduc
e
Map
Reduc
e
Map
Reduc
e
Cached
Load

Memory Opt. Dataflow View
Training
Data
(HDFS)
Map
Reduc
e
Map
Reduc
e
Map
Reduc
e
Efficiently
move data
between
stages
Spark:10-100× faster than Hadoop MapReduce

Resilient Distributed Datasets (RDDs)
API: coarse-grained transformations (map, group-by, join, sort, filter,
sample,…) on immutable collections
Resilient Distributed Datasets (RDDs)
» Collections of objects that can be stored in memory or disk across a cluster
» Built via parallel transformations (map, filter, …)
» Automatically rebuilt on failure
Rich enough to capture many models:
» Data flow models: MapReduce, Dryad, SQL, …
» Specialized models: Pregel, Hama, …
M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.

Abstraction: Dataflow Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...

Apache Spark v1.3 (3/15)
Includes
» Spark (core)
» Spark Streaming
» GraphX
» MLlib
» Spark SQL – Query Processing
Wide range of interfaces:
» Enhanced Dataframes API
» Python / interactive ipython
» Scala / interactive scala shell
» R / interactive R-shell
» Java
Now included in all major Hadoop distributions

Data Intensive Genomics
New population-scale experiments will sequence 10-100k
samples
• 100k samples @ 60x WGS will generate ~20PB of read data and
~300TB of genotype data
End-to-end pipeline latency is important to clinical work
We want to jointly analyze samples to uncover low
frequency variations

How can we improve analysis
productivity?
Flat file formats sacrifice interoperability but do not improve
performance
Common sort order invariants imposed by tools compromise
correctness
Genomics APIs tend to be at a lower level of abstraction, which
compromises productivity

ADAM
An open source, high performance, distributed platform for genomic
analysis
ADAM defines a:
1. Data schema and layout on disk*
2. Programming interface for distributed processing of genomic
data**
3. Command line interface
* Via Parquet and Avro
** Work on Python integration is underway

Data Model is the "Narrow Waist"

Data Format
Schema can be updated without
breaking backwards compatibility
Normalize metadata fields into schema
for O(1) metadata access
Models are “dumb”; enhance as
necessary with rich objects
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Schemas at https://www.github.com/bigdatagenomics/bdg-formats

Parquet: A Modern Big Data Storage
Format
ASF Incubator project, based on Google
Dremel
High performance columnar store with
support for projections and push-down
predicates
Short read data stored in Parquet achieves a
25% improvement in size over compressed
BAM
Enables scale-out using modern Big Data
technology (e.g., Spark)
Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format

ADAM’s API
ADAM is built on top of Apache Spark, which provides the RDD
abstraction —> distributed arrays
Common primitives include:
• Aggregates: BQSR, Indel Realignment
• Bucketing: Duplicate Marking, Concordance
• Region Joins: Variant Calling and Filtration

Adam Performance Bottom Line
F. Nothaft, et. al., “Rethinking Data-Intensive
Science Using Scalable Analytics Systems”,
ACM SIGMOD Conf., June 2015, to appear.
$214.39
$78.92

ADAM Performance Update
Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge
Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch
Achieve linear scalability out to
128 nodes for most tasks
2-4x improvement over {GATK,
samtools,Picard} on single node

Scalable Analytics for Science
Data Model is the “narrow waist” of the architecture
Modern “NoSQL” models support evolution and heterogeneity with high
performance.
BDAS Declarative Analytics: Specify What not How
MLBase chooses:
• Algorithms/Operators
• Ordering and Physical Placement
• Parameter and Hyperparameter Settings
• Featurization
Leverages BDAS (Spark, GraphX, Tachyon) and Hadoop File System
for Speed and Scale

To find out more or get
involved:
amplab.berkeley.edu
franklin@berkeley.edu
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.

Additional resources…
• aws.amazon.com/hpc
• aws.amazon.com/big-data
• aws.amazon.com/grants
• aws.amazon.com/genomics
• aws.amazon.com/compliance
• aws.amazon.com/security

Thank you!
Jamie Kinney
jkinney@amazon.com
@jamiekinney

Time to Science/Time to Results: Transforming Research in the Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Time to Science/Time to Results: Transforming Research in the Cloud

Ähnlich wie Time to Science/Time to Results: Transforming Research in the Cloud (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Time to Science/Time to Results: Transforming Research in the Cloud