This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
Handwritten Text Recognition for manuscripts and early printed texts
Time to Science/Time to Results: Transforming Research in the Cloud
1. Accelerating Time to Science:
Transforming Research in the Cloud
Jamie Kinney - @jamiekinney
Director of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services
Michael Franklin - @amplab
Director, AMPLab - UC Berkeley
2. Agenda
• An introduction to scientific computing on AWS
• How are researchers using AWS today?
• Case study: The UC Berkeley AMP Lab
• Q & A
3. What do we mean by Scientific Computing?
Scientific Computing refers to the application of simulation,
mathematical modeling and quantitative analysis to analyze and
solve scientific problems.
4. How is AWS Used for Scientific Computing?
• High Performance Computing (HPC) for Engineering and Simulation
• High Throughput Computing (HTC) for Data-Intensive Analytics
• Hybrid Supercomputing centers
• Collaborative Research Environments
• Citizen Science
• Science-as-a-Service
5. Why do researchers love using AWS?
Time to Science
Access research
infrastructure in minutes
Low Cost
Pay-as-you-go pricing
Elastic
Easily add or remove capacity
Globally Accessible
Easily Collaborate with
researchers around the world
Secure
A collection of tools to
protect data and privacy
Scalable
Access to effectively
limitless capacity
6. Why does AWS care about Scientific Computing?
• We want to improve our world by accelerating the pace of scientific
discovery
• It is a great application of AWS with a broad customer base
• The scientific community helps us innovate on behalf of all customers
– Streaming data processing & analytics
– Exabyte scale data management solutions and exaflop scale compute
– Collaborative research tools and techniques
– New AWS regions
– Significant advances in low-power compute, storage and data centers
– Efficiencies which will lower our costs and therefore pricing for all customers
7. Research Grants
AWS provides free usage
credits to help researchers:
• Teach advanced courses
• Explore new projects
• Create resources for the
scientific community
aws.amazon.com/grants
8. Peering with all global research networks
Image courtesy John Hover - Brookhaven National Lab
11. High Throughput Computing at Scale
The Large Hadron Collider
@ CERN includes 6,000+
researchers from over 40
countries and produces
approximately 25PB of data
each year.
The ATLAS and CMS
experiments are using AWS
for Monte Carlo simulations
and analysis of LHC data.
12. Data-Intensive Computing
The Square Kilometer Array will link 250,000 radio
telescopes together, creating the world’s most
sensitive telescope. The SKA will generate zettabytes
of raw data, publishing exabytes annually over 30-40
years.
Researchers are using AWS to develop and test:
• Data processing pipelines
• Image visualization tools
• Exabyte-scale research data management
• Collaborative research environments
aws.amazon.com/solutions/case-studies/icrar/
13. High Performance Computing
Simulations in the Automotive Sector
• Crash and materials simulations
• Fluid and thermal dynamics simulations
• Car body aerodynamics
• Electronics and electromagnetic simulations
Honda materials science simulations on AWS:
• Deploying scalable HPC clusters on AWS Spot – up to 1000 C3 instances
• Running more simulations than before, for more accurate results
“Cloud offers us an opportunity, as we can innovate faster than before.”
- Ayumi Tada, IT System Administrator, Honda R&D
14. Schrodinger & Cycle Computing:
Computational Chemistry for Better Solar Power
Simulation by Mark Thompson of the
University of Southern California to see
which of 205,000 organic compounds
could be used for photovoltaic cells for
solar panel material.
Estimated computation time 264 years
completed in 18 hours.
• 156,314 core cluster, 8 regions
• 1.21 petaflops (Rpeak)
• $33,000 or 16¢ per molecule
Loosely
Coupled
15. Science-as-a-Service
Globus Genomics, DNAnexus, and SevenBridges Genomics offer inexpensive, easy-
to-use, and secure platforms for processing and analyzing genomic data.
The Weather Company pushes four gigabytes of data to AWS
each second in order to delivers 15 billion forecasts each day
to their customers around the world.
aws.amazon.com/solutions/case-studies/the-weather-company/
16. Citizen Science
The Asteroid Data Hunters competition used AWS to develop better mechanisms for
finding near-Earth asteroids. The top algorithm is 18% better at finding asteroids!
19. AMPLab Overview
• 80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Networking
• 28 Industry Sponsors +
White House Big Data Program:
NSF CISE Expeditions in Computing and Darpa XData
• Founding Sponsors:
“… Berkeley’s AMPLab has already left an indelible mark on the world of
information technology, and even the web. But we haven’t yet experienced
the full impact of the group … Not even close.”
– Derrick Harris, GigaOM, Aug 2, 2014
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldbergCuller
20. AMPLab: Integrating 3
Resources
Algorithms
• Machine Learning, Statistical Methods
• Prediction, Business Intelligence
Machines
• Clusters and Clouds
• Warehouse Scale Computing
People
• Crowdsourcing, Human Computation
• Data Scientists, Analysts
22. Open Source Community Building
MeetUp on MLbase @Twitter (Aug 6, 2013)
Spark Summit SF (June 30, 2014)
23. Apps: Genomics Patterson et al.
Using BDAS, SNAP (Scalable Nucleotide
Alignment) aligns in minutes vs. days
Why Speed Matters: A real-world use case
ADAM – Data formats and Processing
Patterns for Genomics on Big Data Platforms
(e.g., Spark)
Collaborations with: UCSF, UCSC, OHSU,
Microsoft Research, Mt. Sinai
M. Wilson, …, and C. Chiu, “Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing”,
June 4, 2014, New England Journal of Medicine.
SNA
26. AMPLab Unification Philosophy
Don’t specialize MapReduce – generalize it!
Two additions to Hadoop MR can enable all the models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data Movement
Spark
Streaming
GraphX
…
SparkSQL
MLbase
27. In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.
“It’s only September but it’s already clear that 2014 will
be the year of Apache Spark”
-- Datanami, 9/15/14
• Developed in AMPLab and its predecessor the RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache Spark
35. Memory Opt. Dataflow View
Training
Data
(HDFS)
Map
Reduc
e
Map
Reduc
e
Map
Reduc
e
Efficiently
move data
between
stages
Spark:10-100× faster than Hadoop MapReduce
36. Resilient Distributed Datasets (RDDs)
API: coarse-grained transformations (map, group-by, join, sort, filter,
sample,…) on immutable collections
Resilient Distributed Datasets (RDDs)
» Collections of objects that can be stored in memory or disk across a cluster
» Built via parallel transformations (map, filter, …)
» Automatically rebuilt on failure
Rich enough to capture many models:
» Data flow models: MapReduce, Dryad, SQL, …
» Specialized models: Pregel, Hama, …
M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.
38. Apache Spark v1.3 (3/15)
Includes
» Spark (core)
» Spark Streaming
» GraphX
» MLlib
» Spark SQL – Query Processing
Wide range of interfaces:
» Enhanced Dataframes API
» Python / interactive ipython
» Scala / interactive scala shell
» R / interactive R-shell
» Java
Now included in all major Hadoop distributions
39. Data Intensive Genomics
New population-scale experiments will sequence 10-100k
samples
• 100k samples @ 60x WGS will generate ~20PB of read data and
~300TB of genotype data
End-to-end pipeline latency is important to clinical work
We want to jointly analyze samples to uncover low
frequency variations
40. How can we improve analysis
productivity?
Flat file formats sacrifice interoperability but do not improve
performance
Common sort order invariants imposed by tools compromise
correctness
Genomics APIs tend to be at a lower level of abstraction, which
compromises productivity
41. ADAM
An open source, high performance, distributed platform for genomic
analysis
ADAM defines a:
1. Data schema and layout on disk*
2. Programming interface for distributed processing of genomic
data**
3. Command line interface
* Via Parquet and Avro
** Work on Python integration is underway
43. Data Format
Schema can be updated without
breaking backwards compatibility
Normalize metadata fields into schema
for O(1) metadata access
Models are “dumb”; enhance as
necessary with rich objects
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Schemas at https://www.github.com/bigdatagenomics/bdg-formats
44. Parquet: A Modern Big Data Storage
Format
ASF Incubator project, based on Google
Dremel
High performance columnar store with
support for projections and push-down
predicates
Short read data stored in Parquet achieves a
25% improvement in size over compressed
BAM
Enables scale-out using modern Big Data
technology (e.g., Spark)
Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format
45. ADAM’s API
ADAM is built on top of Apache Spark, which provides the RDD
abstraction —> distributed arrays
Common primitives include:
• Aggregates: BQSR, Indel Realignment
• Bucketing: Duplicate Marking, Concordance
• Region Joins: Variant Calling and Filtration
46. Adam Performance Bottom Line
F. Nothaft, et. al., “Rethinking Data-Intensive
Science Using Scalable Analytics Systems”,
ACM SIGMOD Conf., June 2015, to appear.
$214.39
$78.92
47. ADAM Performance Update
Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge
Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch
Achieve linear scalability out to
128 nodes for most tasks
2-4x improvement over {GATK,
samtools,Picard} on single node
48. Scalable Analytics for Science
Data Model is the “narrow waist” of the architecture
Modern “NoSQL” models support evolution and heterogeneity with high
performance.
BDAS Declarative Analytics: Specify What not How
MLBase chooses:
• Algorithms/Operators
• Ordering and Physical Placement
• Parameter and Hyperparameter Settings
• Featurization
Leverages BDAS (Spark, GraphX, Tachyon) and Hadoop File System
for Speed and Scale
49. To find out more or get
involved:
amplab.berkeley.edu
franklin@berkeley.edu
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.