AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chunky Gupta, Software Engineer @Yelp
David Morrison, Software Engineer @Yelp
December 1, 2016
Lessons Learned from
a Year of Using Spot Fleet
CMP205

What to Expect from the Session
How Yelp is saving money by using Amazon EC2 Spot Fleet!

Outline
Seagull: Yelp’s Distributed System for Concurrent Task Execution
FleetMiser: Scaling Yelp’s Spot Fleet for Fun and Profit
Looking to the Future for Seagull and FleetMiser

Yelp’s Mission
Connecting people with great local businesses

Terminology
On Demand
Reserved
Spot Instances
us-west-2a
(c3.8xlarge)
Spot Market
Resource Unit ≈ 1 vCPU
Spot Instance
• c3.8xlarge
• m4.10xlarge
• …
Cluster
us-west-2b
(c3.8xlarge)
us-west-2c
(c3.8xlarge)
Bundle/Executor

Seagull:
Yelp’s Distributed System For
Concurrent Task Execution

What kinds of tasks are we talking about?
Unit, integration and acceptance tests (Runs ~25
million tests/day)
Photo classification (Runs classifier on tens of millions
of photos in less than a day)
Other applications to come

Seagull is built on top of Apache Mesos
Scheduler 1 Scheduler 2 Scheduler n
Slave 1 Slave 2 Slave 3 Slave m

Where has Yelp’s Seagull Cluster lived?
May 2015 ($$$$)
July 2015 ($$$)
Dec 2015 ($$)
Feb 2016 ($)
OD OD OD OD
SI SI SI RI
SI SI SI RI
SI SI SI SI
+
+

Seagull’s infrastructure costs reduced by 85%
in the last year
SeagullInfrastructureCost
Timeline (May 2015-April 2016)
55% reduction in costs after initial transition to
Spot Instances
Additional 60% savings after
transition to Spot + Auto
Scaling

Why Spot Instances?
• On-Demand Instances
• Reserved Instances

Are Spot Instances actually cheaper?
• If used intelligently, they
can save you a lot of
money
• Be careful! Naive usage
may end up costing more
than on-demand!

How does Spot pricing actually work?
Available Spot Instances
User A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1

How does Spot pricing actually work?
Available Spot Instances
User A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1Spot Bid Price $5

Maintaining cluster stability in bidding wars
On-Demand Price

Step 1: Application level (Seagull) Fault Tolerance
Scheduled Tasks
ExecutionTime
Instances lost due to outbid events

Step 1: Application level (Seagull) Fault Tolerance
Scheduled Tasks
ExecutionTime
Lost tasks rescheduled

Step 2: Cluster-level Fault Tolerance
Amazon EC2 Spot Fleet

Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$
Step 2: Cluster-level Fault Tolerance
us-west-2a
$$$$$$ $
Amazon EC2 Spot Fleet

$
What if the bid price fluctuates?
us-west-2a
$$$$$$$$$$ $$

$$$$$
us-west-2a
$$$$$ $$

$$$$$
us-west-2a
$$$$$ $$ $ $$$$$

On-Demand Price Challenges:
• Availability
• Reliability

How do you deal with churn?
Option 1: Move back to On-Demand and wait for fluctuation to stop
SeagullInfrastructureCost
Timeline (June 2016-Sept 2016)
Seagull costs spiked by 250% when
transitioning back to On-Demand
Instances for a few days

How do you deal with churn?
Getting outbid in three markets doesn’t impact the cluster
Number of units in cluster, grouped by Spot market
Option 2: Diversify! Add more Spot markets to reduce impact

Diversification isn’t always easy
Is your application compatible with other instance sizes and types
(e.g., EBS instances, GPU instances)?

Diversification isn’t always easy
How does your application perform on different instance types?
ExecutionTime
Scheduled Tasks
(color-coded by instance id)

How to use Spot Fleet most intelligently
Be simple and don’t bid too high
Diversify your Spot markets

FleetMiser:
Scaling Yelp’s Spot Fleet for Fun and Profit

Why do we need scaling at all?
Number of Seagull runs
Peak demand is between ~9am and ~7pm

FleetMiser: Yelp’s in-house scaling engine

What does scaling look like?
Number of units in cluster
Developers in Europe
Peak capacity is between ~12pm and ~7pm

FleetMiser uses a plugin-based architecture for
scaling signals
autoscale_signals:
ClusterOverutilizedSignal:
priority: 2
query_period: 10
scale_up_threshold: 0.65
units_to_add: 100
...

Using metrics to control scaling
Cluster underutilized: scale down
Developers submitted batch jobs: maintain capacity/scale up
Cluster overutilized: scale up
(not shown) Historical usage indicates demand: scale up
Number of units in cluster

Scaling up uses the AWS diversification strategy

FleetMiser uses sophisticated scale-down logic to
ensure cluster diversity is maintained

Scaling Down: How to terminate instances
Scale-down evenly distributed across all Spot markets
Number of units in cluster, grouped by Spot market

Comparison to AWS Auto Scaling for Spot Fleets
https://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/
• Driven by CloudWatch metrics
• Policies can scale by constant,
percentage, step function
• No custom scale-down logic
• An easy way to get your cluster
autoscaling
• Custom signal plugins
• Scaling by arbitrary amounts
(based on signal input)
• Specify instances to terminate
• Allows for more complicated
functionality
Spot Fleet scaling FleetMiser scaling

Looking to the Future
for Seagull and FleetMiser

Goal: Diversify our Spot Markets even further

Goal: Diversify our Spot Markets even further
53 bundles!

Goal: More advanced scaling logic for FleetMiser
Combine and control multiple Spot Fleets and Auto Scaling Groups at once

Goal: More advanced scaling logic for FleetMiser
$
$$$
$$$

Goal: Better bundling of tasks for Seagull
task_requirements:
TaskA:
RAM: 100MB
CPU: 3
dependencies:
- ServiceA
- ServiceB
TaskB:
RAM: 10MB
CPU: 1
dependencies:
- ServiceC
...

Use EC2 Spot Fleet with a fault-tolerant application
Yelp’s simple mantra for saving money on your
compute costs

Use scaling to reduce off-hours capacity
Yelp’s simple mantra for saving money on your
compute costs

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Remember to complete
your evaluations!

AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Ähnlich wie AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205) (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)