"Running high-performance scientific and engineering applications is challenging no matter where you do it. Join IT executives from Hitachi Global Storage Technology, The Aerospace Corporation, Novartis, and Cycle Computing and learn how they have used the AWS cloud to deploy mission-critical HPC workloads.
Cycle Computing leads the session on how organizations of any scale can run HPC workloads on AWS. Hitachi Global Storage Technology discusses experiences using the cloud to create next-generation hard drives. The Aerospace Corporation provides perspectives on running MPI and other simulations, and offer insights into considerations like security while running rocket science on the cloud. Novartis Institutes for Biomedical Research talks about a scientific computing environment to do performance benchmark workloads and large HPC clusters, including a 30,000-core environment for research in the fight against cancer, using the Cancer Genome Atlas (TCGA)."
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Invent 2013
1. Real-world Cloud HPC at Scale, for
Production Workloads
Jason A Stowe, Cycle Computing
November 15, 2013
Š 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
3. Goals for today
⢠See real world use cases from 3 leading engineering
and scientific computing users
â Steve Philpott, CIO, HGST, A Western Digital Company
â Bill E. Williams, Director, The Aerospace Corporation
â Michael Steeves, Sr. Systems Engineer, Novartis
⢠Understand the motivations, strategies, lessons learned
in running HPC / Big Data workloads in the cloud
⢠See the varying scales and application types that run
well, including a 1.21 PetaFLOPS environment
5. Journey to the Cloud
Steve Phillpott
CIO
HGST, a Western Digital Company
Š 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
6. Cloud & Datacenter
Performance Enterprise
ď§ Founded in 2003 through the combination of the hard drive
businesses of IBM, the inventor of the hard drive, and
HGST, Ltd
PCIe
Enterprise SSD
(+3 acquisitions)
SAS
10K & 15K
HDDs
ď§ Acquired by Western Digital in 2012
ď§ More than 4,200 active worldwide patents
ď§ Headquartered in San Jose, California
ď§ Approximately 41,000 employees worldwide
ď§ Develops innovative, advanced hard disk drives, enterprise-class
solid state drives, external storage solutions and services
UltrastarÂŽ
Capacity Enterprise
7200 RPM &
CoolSpin
HDDs
UltrastarÂŽ &
MegaScale DCâ˘
ď§ Delivers intelligent storage devices that tightly integrate hardware
and software to maximize solution performance
6
7. Zero to Cloud in 6+ Month
By 31 Oct 2013:
ďźCloud eMail â Microsoft Office365
April 2013
ďźCloud eMail archiving/eDiscovery
ďźExternal SingleSignOn (off VPN)
ďźCloud File/Collaboration â BOX
ďźCloud CRM â Salesforce.com
ďź Integrated to save files in BOX
ďźCloudâHigh Performance Computing
(HPC) on Amazon AWS
ďźCloud â Big Data Platform on Amazon AWS
7
8. Responding to the Changing Business Model
ď§ Where is our business model headed?
âNew Age of Innovationâ as a guide
N=1 Focus on Individual Customer Experience
R=G Resources are Global
ď§ Implications
âIncrease in strategic partnering
âNeed for high level of flexibility
âLeveraging external expertise
Use of the Cloud/SaaS aligns with
Virtual Business Model:
ď§
ď§
ď§
ď§
ď§
Variable cost model critically important
Lightweight, scalable services
Reduced up-front capital spend
Accelerated provisioning
Pay as you go
8
9. Paradigm Shift: Consumerization of IT
âI have better technology at homeâ
Consumer Web
A new paradigm in ease of use and reduced cost.
ď§Consumer web has been driven by a series of
platforms â and these platforms are household brand
names today
ď§When we use these platforms, it continually amazes
us â how easy, how consistent these platforms work
ď§A new set of services: DRM to iTunes
Yet, our workplace applications are cumbersome, costly,
difficult to navigate and require extensive support
Workday, 2009
9
10. The Big Switch â The Box has Disappeared
The Transformation of Computing as we Know it.
ď§ Physical to Virtual/Digital move
â Do you really care which computer processed your last
Google search?
ď§ Efficiency
â Do not waste a CPU cycle or a byte of memory.
Building a 4-story building and only using the 1st floor
ď§ Utility: IT as a Service - Plug it in and get it
â Where the electricity industry has gone, Computing is following
â Computing shift is almost invisible to the end-user
DATA is the value to the Organization, not the âwhereâ
1
11. Enabling the Virtual Organization
Reframing IT Away From Thinking of âThe Appâ
Business Intelligence and Analytics
End-to-End Business Processes
Enterprise Data Management
New Computing Platforms
Strategic
Outsourcing
Software as a Service
(SaaS)
New IT Organizational Structures:
Support and Align to âNew Business Modelâ
1
1
12. Creating an Innovation Playground:
Where to Start and How to Evolve
IT Supports Business Strategy
Executive Buy-In â CEO, CIO, InfoSec, etc
Reduce Cap-ex, Optimize DC usage
Build
Expertise
Implement
Outcome Defined
Knowledge
Play
Learn
Educate
⢠Team Involvement
⢠Conferences
⢠Vendor Briefing
⢠Expert Services
⢠Best Practices
Experiment
⢠Team Approach
⢠Hands-on approach
⢠Understand the value
proposition
⢠Understand constraints
Migrate
⢠Migrate dev/test
environments
⢠Migrate or
launch new apps
on the cloud
Embrace success
Showcase cost
savings
Build an enterprise
cloud strategy
Learn from each
experience
Expand accordingly
⢠Indentify app fit for cloud
computing
⢠Define new processes
⢠Collaborate with
other companies
12
Awareness
Understanding
Transition
Commitment
12
13. Multiple Opportunities to Leverage Amazon Web Services (AWS)
AWS: â >5x the compute capacity than next
14 providers combinedâ â Gartner, Aug 2013
ď§Access to massive compute and storage
ď§Billed by the hour - only pay for what is used
ď§HGST Japan Research Lab: Using AWS for higher
performance, lower cost, faster deployed solution vs. buying
huge on-site cluster
Develop AWS Competency
ď§Many Opportunities: In-house and commercial HPCs are âcloud readyâ
ď§Provide Computing When Needed: Reduce capital investment & risk and increase flexibility
ď§Faster Response to Business Needs: Rapid prototyping to pilot new IT capabilities with âPO
Processâ ; setup users, allocate compute and storage in minutes, load apps and go
ď§AWS provide a great option for disaster recovery for our âon-premiseâ clusters and storage
13
14. HGSTâs Amazon HPC Platform
Case 3: Lube depletion in TAR (2D heat profile)
1.E+07
(300,000 atoms)
Atoms Dealing with
Basic Molecular Simulation
Large Scale Molecular Simulation for HDI
Top view
1.E+06
(Lube molecules spreading onto COC)
Case 3
5 ns
Case 1
1.E+05
1 ns
5 ns
Case 2
1.E+04
Relaxation time: 5 ns
Relaxation time: < 1 ns
1.E+03
0
100
200
300
400
Number of Core
500
600
Heat spot in TAR
36 nm
Molecular
Dynamics
Simulation
Read / Write
Magnetics
Electo â
Magnetic Fields
Mechanical
MAGLAND
Simulation
Application
CST Read / Write
Magnetics
Electo â
Magnetic Fields
Base HPC Platform
ď§Scalable to thousands of
instances to support numerous
simultaneous simulations
Ansys
Commercial
LLG
Ansys
HFSS
Pre- and Post-Processing
Server Farms
ď§New G2 Instances Add
Visualization Capabilities
14
15. Big Dataâs â3 Vâsâ
Three âVâsâ of Big Data
Best pragmatic
Volume
Velocity
â˘Data sources
â˘Data types
â˘Applications
Trends
Variety
â˘Data collected
â˘Analysis & metadata creation
â˘Data acquisition
â˘Analysis & action
Structured
Terabytes
Batch
Unstructured, Semi-Structured
& Structured
Petabytes & Exabytes
Real-Time & Streaming
Implications &
Opportunities
⢠Hardware and software optimization
⢠Architectural shifts: Scale-out systems, Distributed filesystems,
Tiered storage, HadoopâŚ
Key difference: data structure does not need
to be defined before loading
definition from
Snijders et al.
âData sets so large
and complex that
they become
awkward to work
with using standard
tools and
techniquesâ
15
16. Data Sources
Big Data Platform
All raw parametric,
logistic, vintage, data
Parallelized
batch analytics
raw
extracts
Batch Analytics
Enriched
data
Slider
Wafer
Media
Substrate
Optimize/Reduce
Testing
End-to-End
Integrated
Data
.
.
.
SAP/DWâs
App-Specific
Views
Failure Screen
Tests
Proactive Drift
Identification
Field Data
Supplier
Ad hoc Analysis
Customer FA via
Field Data
HDD
HGA
Consumers
New High-Value
Parameters
SAS, Compellon or
other Predictive
Analytic Tools
Tableau, and
other tools
New Unified
EDW
16
17. Characteristics of a âTypicalâ Hadoop / Big Data Cluster
ď§ Hadoop handles large data volumes and reliability in the software tier
â Hadoop distributes data across cluster; uses replication to ensure data reliability
and fault tolerance.
ď§ Each machine in Hadoop cluster stores AND processes data; machines must do both well.
Processing sent directly to the machines storing the data.
ď§ Hadoop MapReduce
Compute Bound Operations
and Workloads
â˘
â˘
â˘
â˘
Clustering/Classification
Complex text mining
Natural-language processing
Feature extraction
ď§ Hadoop MapReduce
I/O Bound Operations
and Workloads
⢠Indexing
⢠Grouping
⢠Data importing and exporting
⢠Data movement and
transform
Big Data Solutions Must Support a Large Variety of
Compute and I/O Operations and Storage Needs âŚenter âthe Cloudâ
17
18. AWS Big Data Platform Storage Services
ď§ Block Storage for Elastic Computing
ď§ Optimized for Performance
ď§ SSD / 15K / 10K
Amazon
EBS
ď§ Highly Virtualized / SAN-Based
ď§ âGenericâ Object Storage
ď§ Bulk of AWS Storage Today
Amazon
S3
ď§ Virtualized or Reserved Use
ď§ Server/Network-Based
ď§ Cold/Cool Storage
Amazon
Glacier
ď§ Lowest Cost Model for âleastâ
used data
ď§ 3-5 hour Latency / Sequentialized
18
19. HGSTâs Other Amazon Use Cases/Capabilities
ď§ Petabyte-Scale Data
Warehousing
ď§ âBetween Glacier & S3â
ď§ Run Data Visualization
tools in AWS
ď§ Resource Tracking Tool
ď§ Includes Tableau
instance for reporting
and visualization
More and
more users
coming to IT
asking for how
to leverage
this new
compute
capability
19
20. We Are Just Starting with the Cloud
⢠Current Results From 6 month Effort
⢠Re-aligning Business Group Leadership
⢠Demands and Use To Grow And Accelerate
Cloud + HGST IT =
Strong Innovation and Business Partner
20
21. Cloud Computing @ Aerospace
Bill Williams, The Aerospace Corporation
Š 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
22. Introduction and Background
⢠IT Executive for the The Aerospace Corporation
â˘
â˘
(Aerospace)
Manage HPC compute and cloud resources for
the Aerospace corporate
Career path has taken me through end user
support, system administration, and enterprise
architecture
25. High Performance Computing @ Aerospace
⢠Allow engineers and scientists to focus on their
â˘
â˘
discipline and research
Reduce and eliminate complexity in using High
Performance Computing (HPC) resources
Supply and support centralized and networked
HPC resources
27. Cloud Motivation
â˘
â˘
â˘
â˘
â˘
â˘
Respond to an increasing and variable demand
Improve resource deployments and use
Enhance provisioning
Improve security posture
Improve disaster recovery posture
Greener
28. Where are we today?
â˘
â˘
â˘
â˘
Successfully established elastic clusters in AWS
GovCloud
â Workload runs include Monte Carlo and Array Simulations
Key features of the GovCloud clusters are
auto-scaling and on-demand computing
Compute instances are created as needed to meet job
computational requirements
Making strides towards mimicking internal clusters in
GovCloud
29. What makes this work?
⢠AWS GovCloud
â GovCloud is FedRAMP compliant
⢠Secure transport to and from Aerospace
â VPC provides an additional layer of security while data is in transit
⢠Cyclecomputing
â Cycle provides cluster auto-scaling
30. Lessons Learned
⢠Enhanced analytics and business intelligence
⢠Customer success stories
⢠Standard images
⢠Demonstrated operational âagilityâ
31. Lessons Learned
â˘
Domain space is dynamic
â˘
Expertise required
â˘
Layers of complexity
â˘
Ensuring data security (in hybrid deployment model)
32. Challenges
â˘
â˘
â˘
â˘
â˘
Establishing a cloud storage infrastructure
Determining appropriate bandwidth between
Aerospace and GovCloud
Library replication of internal systems
System integration with internal authentication
services
Insuring a seamless transition to hybrid services
35. Novartis Institutes for BioMedical Research (NIBR)
ď§
Unique research strategy driven by patient needs
ď§
World-class research organization with about
6000 scientists globally
ď§
Intensifying focus on molecular pathways shared by
various diseases
ď§
Integration of clinical insights with mechanistic
understanding of disease
ď§
Research-to-Development transition redefined
through fast and rigorous âproof-of-conceptâ trials
ď§
Strategic alliances with academia and biotech
strengthen preclinical pipeline
36. Accelerating the Science
ď§ Requirements
Large Scale Computational Chemistry Simulation
Results in under a week
Ability to run multiple experiments âon-demandâ
ď§ Challenges
Sustained access to 50000+ compute cores
Ability to monitor and re-launch jobs
No additional Capital Expenditure
Internal HPCC already running at capacity
ď§ Job Profile
Embarrassingly Parallel
CPU Bound
Low I/O, Memory and Network requirements
Virtual Screening
Target
Molecule
Compound
Molecule
binding
site
"Lock"
"Keys"
37. The Cloud: Flexible Science on Flexible Infrastructure
Engineering the right infrastructure for a workload:
ď§ Software runs the same job many times across instance types
ď§ Measures the throughput and determines the $ per job
ď§ Use the instances that provide the best scientific ROI
ď§ CC2 instance (Intel XeonÂŽ âSandy Bridgeâ) ran best for this
38. Super Computing in the Cloud
Metric
Compute Hours of Science
341,700 hours
Compute Days of Science
14,238 days
Compute Years of Science
39 years
AWS Instance Count-CC2
ď§
ď§
ď§
ď§
Count
10,600 instances
$44 Million infrastructure
10 million compounds screened
39 Drug Design years in 11 hours for a cost of âŚ$4,232
3 compounds identified and synthesized for screening
39. Key Learnings/Whatâs Next?
ď§Diversity of Life Sciences brings unique challenges
ď Spend the time analyzing and tuning
ď Flexibility, Scalability and Performance
ď Time to rethink and retool
ď Challenge the Science and the Scientist
ď Collaboration
ď§Future plans
ď Chemical Universe : 166 Billion cpds (Extreme scale CPU)
ď Next Generation Sequencing in the Cloud (Extreme CPU, Mem, I/O)
ď âDisruptiveâ Technologies-Imaging (10x that of NGS!)
40. Using On-Demand and
Spot Instances together
When task durations are > than 1
hour or require multiple machines
(MPI) for long periods, then use ondemand
Shorter workloads work great for
Spot Instances
If you want a guaranteed end time,
use on-demand as well, so the
architecture looks likeâŚ
41. User
Scale from 150 - 150,000+ cores
CycleCloud Deploys Secured, Auto-scaled HPC Clusters
HPC Cluster
Load-based Spot bidding
On-Demand Execute Nodes
(Guaranteed finish)
Check job load
Calculate ideal HPC cluster
Legacy
Internal
HPC
Shared FS
Spot Instance Execute Nodes
(auto-started & auto-stopped
calculation is faster/cheaper)
Properly price the bids
Manage Spot Instance loss
FS /
S3
HPC Orchestration to
Handle Spot Instance Bid & Loss
42. Other Production use cases
â˘
â˘
â˘
â˘
â˘
â˘
â˘
Sequencing, Genomics, Life Sciences
MPI workloads for FEA, CFD, energy, utilities
MATLAB and R applications for stats/modeling
Win HPC Server cluster for finance
Heat transfer and other FEA
Insurance risk management
Rendering/VFX
43. Designing Solar Materials
The Challenge is efficiency
Need to efficiently turn photons from the sun to Electricity
The number of possible materials is limitless:
⢠Need to separate the right compounds from the useless ones
⢠If the 20th century was the century of silicon, the 21st will be
all organic
How do we find the right material out of 205,000
without spending the entire 21st century looking for it?
EMBARGOED until Nov. 12, 2013 8 a.m. EST
51. Question and Answer
How does utility HPC apply to your organization?
Follow us: @cyclecomputing, @jasonastowe
Come to Cycleâs booth: #1112
Weâre hiring jointheteam@cyclecomputing.com
52. Please give us your feedback on this
presentation
BDT212
As a thank you, we will select prize
winners daily for completed surveys!