Over the last year, Yelp has transitioned its scalable and reliable parallel task execution system, Seagull, from On-Demand and Reserved Instances entirely to Spot Fleet. Seagull runs over 28 million tests per day, launches more than 2.5 million Docker containers per day, and uses over 10,000 vCPUs in Spot Fleet at peak capacity. To deal with rising infrastructure costs for Seagull, we have extended our in-house Auto Scaling Engine called FleetMiser to scale the Spot Fleet in response to demand. FleetMiser has reduced Seagull’s cluster costs by 60% in the past year and saved Yelp thousands of dollars every month.
In this session, we describe how Yelp uses Spot Fleet for Seagull and lessons we’ve learned over the past year, along with our recommendations on how to use it reliably (pro tip: don’t get outbid for your whole Spot Fleet). We conclude by looking at our future plans for extending Spot Fleet usage at Yelp.
2. What to Expect from the Session
How Yelp is saving money by using Amazon EC2 Spot Fleet!
3. Outline
Seagull: Yelp’s Distributed System for Concurrent Task Execution
FleetMiser: Scaling Yelp’s Spot Fleet for Fun and Profit
Looking to the Future for Seagull and FleetMiser
7. What kinds of tasks are we talking about?
Unit, integration and acceptance tests (Runs ~25
million tests/day)
Photo classification (Runs classifier on tens of millions
of photos in less than a day)
Other applications to come
8. Seagull is built on top of Apache Mesos
Scheduler 1 Scheduler 2 Scheduler n
Slave 1 Slave 2 Slave 3 Slave m
9. Seagull is built on top of Apache Mesos
Scheduler 1 Scheduler 2 Scheduler n
Slave 1 Slave 2 Slave 3 Slave m
10. Where has Yelp’s Seagull Cluster lived?
May 2015 ($$$$)
July 2015 ($$$)
Dec 2015 ($$)
Feb 2016 ($)
OD OD OD OD
SI SI SI RI
SI SI SI RI
SI SI SI SI
+
+
11. Seagull’s infrastructure costs reduced by 85%
in the last year
SeagullInfrastructureCost
Timeline (May 2015-April 2016)
55% reduction in costs after initial transition to
Spot Instances
Additional 60% savings after
transition to Spot + Auto
Scaling
13. Are Spot Instances actually cheaper?
• If used intelligently, they
can save you a lot of
money
• Be careful! Naive usage
may end up costing more
than on-demand!
14. How does Spot pricing actually work?
Available Spot Instances
User A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1
15. How does Spot pricing actually work?
Available Spot Instances
User A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1Spot Bid Price $5
22. Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$
What if the bid price fluctuates?
us-west-2a
$$$$$$$$$$ $$
23. Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$$$$$
What if the bid price fluctuates?
us-west-2a
$$$$$ $$
24. Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$$$$$
What if the bid price fluctuates?
us-west-2a
$$$$$ $$ $ $$$$$
25. What if the bid price fluctuates?
On-Demand Price Challenges:
• Availability
• Reliability
26. How do you deal with churn?
Option 1: Move back to On-Demand and wait for fluctuation to stop
SeagullInfrastructureCost
Timeline (June 2016-Sept 2016)
Seagull costs spiked by 250% when
transitioning back to On-Demand
Instances for a few days
27. How do you deal with churn?
Getting outbid in three markets doesn’t impact the cluster
Number of units in cluster, grouped by Spot market
Option 2: Diversify! Add more Spot markets to reduce impact
28. Diversification isn’t always easy
Is your application compatible with other instance sizes and types
(e.g., EBS instances, GPU instances)?
29. Diversification isn’t always easy
How does your application perform on different instance types?
ExecutionTime
Scheduled Tasks
(color-coded by instance id)
30. How to use Spot Fleet most intelligently
Be simple and don’t bid too high
Diversify your Spot markets
36. FleetMiser uses a plugin-based architecture for
scaling signals
autoscale_signals:
ClusterOverutilizedSignal:
priority: 2
query_period: 10
scale_up_threshold: 0.65
units_to_add: 100
...
37. Using metrics to control scaling
Cluster underutilized: scale down
Developers submitted batch jobs: maintain capacity/scale up
Cluster overutilized: scale up
(not shown) Historical usage indicates demand: scale up
Number of units in cluster
41. Scaling Down: How to terminate instances
Scale-down evenly distributed across all Spot markets
Number of units in cluster, grouped by Spot market
42. Comparison to AWS Auto Scaling for Spot Fleets
https://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/
• Driven by CloudWatch metrics
• Policies can scale by constant,
percentage, step function
• No custom scale-down logic
• An easy way to get your cluster
autoscaling
• Custom signal plugins
• Scaling by arbitrary amounts
(based on signal input)
• Specify instances to terminate
• Allows for more complicated
functionality
Spot Fleet scaling FleetMiser scaling