Scaling allows cloud resources to scale automatically in reaction to the dynamic needs of customers. This session will show how Auto Scaling offers an advantage to everyone – whether it’s basic fleet management to keep instances healthy as an EC2 best practice, or dynamic scaling to manage “extremes”. We’ll share examples of how Auto Scaling is helping customers of all sizes and industries unlock use cases and value. We’ll also discuss how Auto Scaling is evolving to scaling different types of elastic AWS resources beyond EC2 instances. NASA Jet Propulsion Laboratory (JPL) / California Institute of Technology will share how Auto Scaling is used to scale science data processing of Interferometric Synthetic Aperture Radar (InSAR) data from earth-observing satellite missions, and reduce response times during hazard response events such as those from earthquakes, floods, and volcanoes. JPL will also discuss how they are integrating their science data systems with the AWS ecosystem to expand into NASA’s next two large-scale missions with remote-sensing radar-based observations. Learn how Auto Scaling is being used at a global scale – and beyond!
2. Availability with fleet management
Keeping up with demand with dynamic scaling
Future of Auto Scaling
3. Myth Fact
My application doesn’t need scaling,
so I don’t benefit from Auto Scaling
It’s hard to use
My instances are stateful or unique;
I can’t use Auto Scaling
4. Auto Scaling groupAuto Scaling group
Auto Scaling
Fleet management Dynamic scaling
ELB
EC2 instances
ELB
CPU
Utilization
EC2 instances
5. Is Fleet Management For You?
“I’ve got instances serving a
business-impacting application”
“If my instances become unhealthy,
I’d like them replaced automatically”
“I would like my instances distributed
to maximize resilience”
8. Scaling on a Schedule
Recurring scaling
events
Schedule
individual events
Auto Scaling group
ELB
EC2 instances
9. Reactive Scaling Policies
Collect metrics
Alarm fires when
threshold is crossed
Auto Scaling
Alarm is
connected to
Auto Scaling
policy
Scaling event is triggered
ELB
Auto Scaling group
11. Challenge: Remote Sensing for Disaster Response
Automated data system are required to analyze large quantities
of data from NASA NISAR, other satellite missions, and rapidly
expandingGPS networks.
Mean Access Time (Day)
∞ 4 2 1.3 1
Going from Artisan to Automation: Use system engineering
approach to translate specialized data analysis into
operational capability.
Orbit Data Coherence
Library
AZO / MAI
EQ
Kinema c
Modeling
Level-0 Data
Kinema c
Model
Catalog &
Waveform
Sta c
Model
InSAR Processing
Time Series
Con guous Pairs Sta c
Modeling
Int. Phase
Stress Change
III. Rou ne Data Processing
Coherence
Change Maps
Seismometer
Radar Sensor
I. Geode c
Sensors
II. Data Providers / Archives
Global CMT
USGS NEIC
IRIS (Waveform)
Fault
Geometry
Mw>5.5 Mw>6.5
Yes
Yes
No Small EQ
Modeling
Amp / Arr. T
correc on
IV. Triggered
Legend
Ready to be used ( > 70 % readiness )
Need modifica on ( 30 – 70 % readiness)
Doesn’t exist ( < 30 % readiness) Latency / Bo leneck
Out of scope of R&TD
Customer Products
Level-0 Data
Op cal Im.
Processing
Offset Image
Event Driven
Op cal Sensor
SPOT Image (SPOT 1-5,
Formosat 2)
NSPO (SPOT)
DigitalGlobe (Quickbird,
Worldview 1)
RINEX Data
GPS Processing
Daily
Point Pos.
1 Hz (GDGPS)
Hourly
Daily
GPS Sta on
Orbit Data
Sub-daily
Point Pos.
SOPAC
+ many others
GPS Constella on
NASA - JPL
V. Customer
Products
Damage
Assessment
3D Surface
Disp. Maps
Ground
Mo on
3D Surface
Disp. Maps
EQ catalog
CTBT
correc on
surface
Coseismic
Interferogram
Coseismic GPS
vector map
Coseismic
Interferogram
Coseismic GPS
vector map
Temporal Records
of Ground Deformation
Spatial Maps of Ground
Deformation
Earthquake
Models
Coseismic Ground
Deformation
Demonstrate response to hazards with standardized set of data products for decision & policy makers.
Coseismic
Damage
12. Automated
Data Collection
& Processing
Integrating
Space Geodesy
Seismology
Modeling
Radar
Sensors
Building Damage
& Inundation
Radar
GPS Seismology
Permanent
Ground Deformation
High-Resolution
Hazard Assessment
from Fault Models
Monitoring &
Near Real-Time
Assessment
“We have high hopes
for ARIA.”
Keiko Saito
World Bank
“This is exactly the kind of
product we are looking for.”
Anne Rosinski
Calif. Geol. Survey
Examples from the 2011 M9.0
Tohoku-Oki (Japan) earthquake
GPS
Networks
Seismic
Networks
Optical
Sensors
Advanced Rapid Imaging and Analysis (ARIA)
14. • Sample pipelined PGE
• Each major PGE processing step is deployed in its own AWS auto-scaling group (ASG)
Process Control & Management in AWS
Availability Zone
Auto Scaling group
instances
Compute Workers
Availability Zone
Auto Scaling group
instances
Compute Workers
Availability Zone
Auto Scaling group
instances
Compute Workers
bucket with
objects
Metadata Data
Availability Zone
instances
Process Control &
Management services
Search Catalog of Datasets
instances
Resource Management
instances
Analytics / Metrics
Release to Distribution Data Centers
Peg12
Dop12
Int-
gen
I12,A1
2
tropoMap I12,A1
2
tc
tc
unwrap I12,A1
2
tc,u
tc,u
int
Telemetry
Job control
15. Auto Scaling Science Data System
The size of the science data system compute nodes can automatically grow/shrink based on processing demand
Auto Scaling group policies
• Auto scaling size = 21
• Metric alarm of queue size > 20
• Auto scaling rest period of 60-seconds
ASG max set to 1000 instances x 32 vCPUs
Auto scaling enabling runs
of over 100,000 vCPUs
16. Hybrid Cloud Science Data System
OCO-2 L2 Full Physics processing operational in AWS
• Processing of L2 full physics data products in Amazon cloud across multiple regions
• Scaled up thousands of compute nodes
• Demonstrated capability of higher internal data throughput rates than NISAR needs
Number of
compute
nodes over
time
Per node
transfer rate
over time
Scalable
internal data
throughput
@ 32,000 PGEs on
1,000 nodes
17. Optimizing for Scaling In/Out Events
• Scaling group resets and timeout periods
• Scale up in group sizes of multiples of
availability zones (AZ) to minimize AZ
load rebalancing terminations
Scaling up (scale out) • What policy to set to scale down?
• E.g. CPU / network utilization
• Domain knowledge only known within the
compute instances
• Self-terminating instances
• “harakiri” / “seppuku”
Scaling down (scale in)
18. Spot Market
• Major cost savings (75%-90% savings over on-demand)…if can use
spot instances
• On spot market, AWS will terminate compute instances if market
prices exceed bid threshold
• HySDS able to self-recover from spot instance terminations by using
Auto Scaling
20. “Market Maker”
This OCO-2 data production
run of 1000 x 32vCPUs
affected the market prices
Strategy:
• Mitigate impact on spot market
• Diversification of resources
• “Spot fleet”
21. “Thundering Herd”
Fleet of ASG compute
instances calling same
services at same time
• “API rate limit exceeded”
“Jittering” the API calls
• Introduce randomizations
to API calls
• Distributes load on
infrastructure
Service
Compute Fleet Instances
22. Next Generation Missions
0
10
20
30
40
50
60
70
80
90
100
OCO-2 SMAP NISAR
Estimated Daily Volume (TB)
• The volume of data
produced is larger
than previous
missions
• Data storage,
processing,
movement, and
costs are the
biggest challenges
(2021)(2009) (2015)
27. Summary
• Availability with Fleet Management
• Resources on aws.amazon.com/autoscaling
• Keeping up with demand with Dynamic Scaling
• Case studies on Auto Scaling website
• Future of Auto Scaling
• Blog posts on Application Auto Scaling
• Questions:
• autoscaling-reinvent2016-questions@amazon.com
28. Thank you!
Additional authors: Hook Hua1, Gerald Manipon1, Michael Starch1, Lan Dang1, Justin Linick1, Sang-Ho Yun1, Gian
Franco Sacco1, Piyush Agram1, Susan Owen1, Brian Bue1,, Eric Fielding1, Paul Lundgren1, Angelyn Moore1, Paul
Rosen1, Zhen Liu1, Eric Gurrola1, Tom Farr1, Vincent Realmuto1, Frank Webb1, Mark Simons2, Pietro Milillo1
1 Jet Propulsion Laboratory
2 California Institute of Technology