Amazon EC2 allows you to bid for and run spare EC2 capacity, known as Spot instances, in a dynamically priced market. On average, customers save 80% to 90% compared to On Demand prices by using Spot instances. Achieving these savings has historically required time and effort to find the best deals while managing compute capacity as supply and demand fluctuate.
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
This One Weird API Request Will Save You Thousands
1. This One Weird API Request Will
Save You Thousands
Joshua Burgin, General Manager EC2 Spot
Zeev Stolin, DevOps, Gett
2. On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
AWS EC2 Consumption Models
Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Spot
Bid for unused capacity,
charged at a Spot Price
which fluctuates based
on supply and demand
For time-insensitive,
transient and cost
sensitive workloads
3. Spare capacity at scale
AWS has more than a
million active customers
in 190 countries.
On Average, every
week, AWS customers
are using more compute
capacity on EC2 Spot
Instances than
customers in 2012 were
running across all of
EC2.
4. With Spot the rules are simple
Markets where the price of
compute changes based on
supply and demand
You’ll never pay more than your
bid. When the market exceeds
your bid you get 2 minutes to
wrap up your work
5. $0.27 $0.29$0.50
1b 1c1a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C3
$1.76
On
Demand
$0.88
$0.44
$.22
$0.11
Show me the markets!
Each instance family
Each instance size
Each Availability Zone
In every region
Is a separate Spot Market
9. Why use Spot – customer examples
“The company has saved tens of thousands of dollars. That’s
between 20 and 30 percent of our total monthly AWS bill.”
Gal Aviv Research & Development Group Manager
10. Why use Spot – customer examples
The raw data from the CMS experiment in the Large Hadron
Collider (LHC) is recorded every 25 nanoseconds at a rate of
approximately 1 petabyte per second.
12. Spot Bid Advisor
1) We make this easy using the
Spot bid advisor
2) With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to.
3) And with new features like
Spot fleet diversified we do
the heavy lifting for you...
14. Spot fleet – fly like a pro
Launch Thousands of Spot Instances
with one RequestSpotFleet call.
Get Best Price
Find the lowest priced horsepower that works for you.
or
Get Diversified Resources
Diversify your fleet. Grow your availability.
And
Apply Custom Weighting
Create your own capacity unit based on your application
needs
15. Spot fleet – continued innovation
One-Time Fleets [May 2016]
CloudWatch Metrics for Spot Fleets [Mar 2016]
Modify Your Fleet [Oct 2015]
Distribute Your Fleet Across Multiple Capacity Pools [Sep 2015]
Weighted Bidding for EC2 Spot Instances [Aug 2015]
Spot instances in the lowest priced Availability Zone [Jul 2015]
17. An easy to use interface that
lets you launch Spot instances,
fleets & Spot blocks in seconds
Helps you select and bid on the
EC2 instances that meet your
applications requirements
Simple to use dashboard lets
you modify and manage your
application’s compute capacity
Spot Console – New [June 2016]
25. Core nodes
Master
Node
Master instance group
Hadoop cluster
Core instance group
HDFS HDFS
Core nodes run
TaskTracker and
Datanode (HDFS)
Process Data with
mappers and
reducers, store data
with HDFS or
DataNode
30. Results - Hadoop
Requested 1000
vCores over 30 days
Minimum 848 vCores
Mode 1008 vCores
Average 1005 vCores
Average Price of
$0.0118 per vCore
Savings of over 81%
31. Capitalizing on two minute warning
When the Spot price exceeds
your bid price, the instance will
receive a two-minute warning
Check for the 2 minute spot
instance termination
notification every 5 seconds
leveraging a script invoked at
instance launch
32. Sample script – two minutes left!
1) Check for 2 minute warning
2) If YES, run shutdown scripts
3) OTHERWISE, do nothing
4) Then sleep for 5 seconds
#!/bin/bash
while true
do
if curl -s http://169.254.169.254/latest/meta-
data/spot/termination-time | grep -q .*T.*Z; then
/env/bin/runterminationscripts.sh;
else
# Spot instance not yet marked for termination.
sleep 5
fi
done
33. • No need to scale HDFS
– Capacity
– Replication for durability
• Amazon S3 scales with your
data
– Both in IOPs and data storage
– Massively parallel
EMRFS - Amazon
S3 as HDFS
Spot block for HDFS
• For core nodes if HDFS
cluster lives for less than
6 hours
34. Hadoop on EC2 Spot – takeaways
Your Work
Run task nodes separately with EC2 Spot fleet
Spot blocks for core/HDFS clusters that live less than 6 hours
What EC2 Spot fleet does for you
Saves you money
Heterogeneous instance management
Scale on the unit that matters to you
Accelerate results (time is money)
36. Stateless Web Application
Elastic Load
Balancing
Stateless
Web Servers
(Spot)
Stateless
Web Servers
(Spot)
Session
State Data
Spot fleet
Availability Zone A
Availability Zone B
Stateless
Web Servers
(Spot)
Stateless
Web Servers
(Spot)
37. Diversification with EC2 Spot fleet
Multiple EC2 Spot instances
selected
Multiple Availability Zones
selected
Pick the instances with similar
performance characteristics e.g.
c3.large, m3.large, m4.large,
r3.large, c4.large.
40. Results - Web Application
50 instances requested,
over 30 days.
- Never dropped
below 45 instances
- 85% discount if you
wanted 50 and
could withstand
dropping to 45
0
0.02
0.04
0.06
0.08
0.1
0.12
30
35
40
45
50
55
Instances Average Price Per Instance
- If you only wanted
45 the discount is
still 83%
42. Since Spot fleet is configured to
span across multiple Availability
Zones, we highly recommend
enabling cross-zone load
balancing for the load balancer.
To allow in-flight requests to
complete when de-registering Spot
instances that are about to be
terminated, connection draining
can be enabled on the load
balancer with a timeout of 90
seconds.
Elastic Load Balancing
43. Capitalizing on two minute warning
When the Spot price exceeds
your bid price, the instance will
receive a two-minute warning
Check for the 2 minute spot
instance termination
notification every 5 seconds
leveraging a script invoked at
instance launch
44. Sample script – two minutes left!
1) Check for 2 minute
warning
2) If YES, detach instance
from ELB
3) OTHERWISE, do nothing
4) Sleep for 5 seconds
$ if curl -s http://169.254.169.254/latest/meta-
data/spot/termination-time |
grep -q .*T.*Z; then instance_id=$(curl -s
http://169.254.169.254/latest/meta-data/instance-id);
aws elb deregister-instances-from-load-balancer
--load-balancer-name my-load-balancer
--instances $instance_id;
/env/bin/flushsessiontoDBonterminationscript.sh; fi
45. For those of you - Using Auto Scaling
Two Auto Scaling groups
• On-demand + Reserved for
base use
• Add an additional Auto Scaling
group with Spot
Both Auto Scaling groups behind
the same Elastic Load Balancer.
Use the bid advisor to select the
right instance time for your
application.
46. Web Application Architecture with Spot
Elastic Load
Balancing
Stateless
Web Servers
Stateless
Web Servers
On Demand Auto
Scaling group
Session
State Data
Stateless Web
Servers (Spot)
Stateless Web
Servers (Spot)
Spot Auto
Scaling group
Availability Zone A
Availability Zone B
On-Demand
ASG
Spot ASG
49. Gett is the largest and fastest
growing
on-demand mobility company in
EMEA
• 5,000+ corporate accounts
• 300% annual growth since
inception
• $500+ million in funding
• $500 million annual
revenue
• 50,000+ cabs globally
• 60 cities worldwide
• 30M+ passengers
50. Gett is the leader in On-
demand
mobility in Israel
• 7,500 vehicles
• National coverage
• 50,000+ rides daily
• 1.5 million users
• 1,700+ corporate
accounts
• 80% brand recognition
55. Why use spots?
… a traffic that requires much CPU Power and Memory to process:
In Production:
~300 EC2 Instances
For BI + Staging + Development:
~350-400 Additional EC2 Instances
56. In Production
We replaced ~70% of the On-Demand Instances with a Spot
Instances. We left only 3 On-Demand Instances per service and the
rest are Spots
Result:
65% Cost Saving for production
Availability is improved (due to the additional HW redundancy)
Latency is improved (due to the additional HW resources)
57. In Production (numbers)
~300 servers for ~30 services
3 on demand instances per service ~ 90 for high availability
200 spots of m3.large, m3.xlarge, c3.2xlarge ~ 400k $ saved annually
58. We replaced almost all of the On-Demand Instances with Spot
Instances.
Result:
85% Cost Saving for these environments.
Spots allows us to run as much staging environments as we need.
The cost saving - is tremendous!
In BI, Staging, and Development
59. In BI, Staging, and Development (numbers)
In order to use agile methodology we need a lot of staging environments
On demand : 15 environments 20 m3.large servers each. ($0.146 per Hour X
20 X 15 X 24 X 365 ~ 400k $ annually)
Spot: 15 environments 20 m3.large servers each. ($0.0211 per Hour X 20 X
15 X 24 X 365 ~ 50k $ annually)
~ 350k $ saved annually
60. Other Notes
Using spot-request resources for each service (for persistence).
Smart Bid price mechanism based on Spot bid advisor.
→ Bid price is based on instance type.
We use fulfillment option of spot-requests for persistency.
We use Terraform (from HashiCorp) and Green/Blue deployment.
Slide: AWS Purchase Models
As shown by the previous slide, it is possible to launch significant amounts of compute power for a low cost. Customer have several models available when using Amazon EC2.
- Cover the three pricing models on the slide
On demand is the easiest way to get started with AWS. No commitment, pay as you go.
Reserved instances provide a significant discount in exchange for a commitment to use the services for some period of time, either 1 or 3 years. Reserved instances also come with an actual capacity reservation, which can be important for large enterprises who need a high level of assurance that computing resources will be available when they are needed.
Spot instances are a unique and powerful pricing model, in particular for HPC. With Spot, customers can bid on unused AWS capacity and are often able to launch instances on the cloud for as little as 10% of the equivalent on-demand rate. The tradeoff for Spot is if other customers are willing to pay more than you for the same AWS instance type, or capacity of that type becomes constrained, your running jobs may be terminated without warning. Jobs running on Spot therefore need to be fault-tolerant, or able to be restarted again at a later time.
What spare capacity looks like at scale.
AWS has more than a million active customers in 190 countries.
Amazon EC2 instance usage has increased 93% YoY, comparing Q4 2014 and Q4 2013, not including Amazon use.
Amazon S3 holds trillions of objects and regularly peaks at millions of requests per second.
So with EC2 Spot the rules are actually really simple.
Rule 1: The Spot market is where price of compute fluctuations based on supply and demand.
Rule 2: You’ll never pay more than your bid, in fact you’ll only ever pay the market price. When the market price exceeds your bid you get 2 minutes to wrap up.
Market price is on average 85% lower than On-Demand prices
What is in a market.. This is one of the most important, and unfortunately misunderstood elements of how the spot market works. While we say Spot market there are actually hundreds of Spot markets available to all our customers. AWS has 11 (?) regions around the world, in each region there are multiple availability zones and multiple instance families and multiple instance sizes per family.. (START CLICK THROUGH and READ). E.g. c3. e.g. large, xlarge, 8xlarge, e.g. US-West-2a, US-West-2b, e.g. Dublin Region, Oregon Region, Sydney Region.
Now that we understand what a spot market is and that there are many I’ll explain how we acquire the capacity. I’m going to pick just one market to highlight this. There are two numbers you care about with Spot.
Bid price. Think of this as the cap, the maximum you’re willing to pay for a given instance per hour.
Market price. This is the price you pay. Market price is set by periodic auctions
The r3.4xlarge costs $1.4 under our On-Demand purchasing option.
See it in action via 3 bids. 25%, 50%, 75%. Single Zone.
25% you kept your instance for almost 7 days, being impacted during a few short periods. However, you only paid the market price which was 86% off, just less than 20c per hour during the last week, only 14% of the OD price.
At 50% you would have been interrupted just once, for a very short period of time during the sixth day. You’re average discount during the week is 85% just 21c per hour, paying just 15% of OD.
At 75% you would not once have been interrupted, achieving an average discount of 85% just 21c an hour, again paying just 15% of OD.
Inneractive is a mobile ad exchange that provides technologies for the buying and selling of mobile advertising space. The company provides mobile app developers with access to an international portfolio of advertising networks, connecting brands to applications. They serve content to more than 450 million unique users a month.
The $9 dollar experiment. Using AWS, the HEP Cloud project at FermiLab successfully demonstrated the ability to add 58,000 cores elastically to their on-premises facility for the CMS experiment. They did this because scientists are competitive.
1st - Check out the Spot Bid Advisor, which we launched earlier this year to guide customers in finding the resources, discount and instance lifetime they need.
The bid advisor has helped many new customers discover what some already knew. That with deliberate instance pool selection it can be straight forward to begin using Spot.
Take this is a snap I took from the tool last week and it shows that even at a 50% max bid there many different Spot markets that would have gone uninterrupted for over a week, while they got an average discount over 80-90%!
Now you might realize, wouldn't it be great if I could automate using all the pools that suit my application? Lets not get ahead of ourselves. First we need to understand, what is a Spot market?
1st - Check out the Spot Bid Advisor, which we launched earlier this year to guide customers in finding the resources, discount and instance lifetime they need.
The bid advisor has helped many new customers discover what some already knew. That with deliberate instance pool selection it can be straight forward to begin using Spot.
Take this is a snap I took from the tool last week and it shows that even at a 50% max bid there many different Spot markets that would have gone uninterrupted for over a week, while they got an average discount over 80-90%!
Now you might realize, wouldn't it be great if I could automate using all the pools that suit my application? Lets not get ahead of ourselves. First we need to understand, what is a Spot market?
Hopefully many of you have come across the EC2 Spot fleet API. This one weird API makes it easy to:
Launch 1,2 or 3000 Spot instances with one API call
You can select whether you’d like to put your capacity into the single cheapest market,
Or opt to diversify to minimize the impact of any individual Spot market
Finally, by introducing Weights you can now scale based on the metric that matter most to you. It might be cores, memory, instances, latency.. It is your call.
Just 5 months ago we launched fleet and have continued the AWS trend of rapid innovation based on customer feedback. We’ve launched 4 major features to spot fleet over the last 5 months and we’re nowhere near finished. We’ve also made it so easy!
We’ve made it so easy! I’m sure some of you can’t wait to dive into creating your first EC2 Spot fleet template via API or CLI! See how happy he is!
For those of you like me that just had a cold shiver, I’d like to introduce the EC2 Spot Console, that makes it easy for customers to launch 1, 2 or 3000 EC2 Spot instances!
An easy to use interface that lets you launch spare EC2 instances in seconds
2) Helps you select and bid on the EC2 instances that meet your applications requirements
3) Simple to use dashboard lets you modify and manage your application’s compute capacity EC2 Spot Console – [Launched Sept 30th!]
Two strategies:
1. Lowest Price - The Spot instances come from the pool with the lowest price. This is the default strategy.
2. Diversified - The Spot instances are distributed across all pools.
Two strategies:
1. Lowest Price - The Spot instances come from the pool with the lowest price. This is the default strategy.
2. Diversified - The Spot instances are distributed across all pools.
Blocks content.
You can add a time requirement for your Spot requests that you’d like to ensure stay alive.
Do we want a Cloudyn assessment? X% of people. Customers who’ve registered with Cloudyn fall into this category.
First we will do a rapid review of Best practices, then step you through the new ways to leverage Spot for Hadoop, web applications and batch processing.
We will first run through what the ‘best practices’ for EC2. While these are not necessary, they’re what the most sophisticated customers do to get high performance, high availability and low costs.
Standard practice
Stateless
Fault tolerant
Multi-AZ
SOA/Loosely coupled design
Spot Practice
Be instance flexible
This can mean c3.large, c3.xlarge,..r3.large
Or m3.large, r3.large, c3.large (ELB)
No seriously, your application can work with other instances (use example, drive this message home hard).
You use c3.xlarge and you can’t AT all use c3.2xlarge? Really? Really? Even if we give you 70% off for twice the c3.xlarge specs?
A common Hadoop model has the concept of Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop slave nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode.
It is relatively straight forward to scale core nodes up.. When you need more CPU, more memory or more HDFS space. But it is difficult to scale down HDFS on the fly, it can lead to HDFS corruption.
The introduction of YARN enabled the use of heterogeneous instances. All will not be equal, you still need to consider resource constraints but as an example if you can use the c3.2xlarge you can likely also use the c3.4 and c3.8.. And if you can use the c3.8xlarge, you can almost certainly use the r3.8xlarge.
1 - If as an example you often used the c3.2xlarge you’ll almost certainly find you can use the c3.4xlarge, c3.8xlarge. You might also be able to use the c3.xlarge. These instances have the same chips and ratio of cores to MEM.. Hence they are the c3 family.
2 – What a lot of customers then discover is Spot pricing is NOT correlated with On-Demand prices, more often you’ll see it correlated on cores. Therefore a lot of our Hadoop customers, and EMR customers if using c3’s will also go ahead and select the same sized r3 instances.
3 – Finally because a lot of data can be passed between the map and reduce phase for the most cost effective cluster we will only select a single availability zone.
As you can see because we’ve selected to scale on cores the console has assigned the instances a weight equal to the instance core counts. In this example we will scale on cores. As I mentioned earlier for those of you who prefer you can print out the CLI version the console has created.
1000 vCores, at an average saving of 81% off On-Demand. While some capacity fluctuated we had our desired capacity of 1000 for over 99% of the time. During the 30 days we were never more than 16%, or 152 cores below our desired capacity while maintaining an average of 1005 cores.
Hadoop: c3.2xlarge c3.4xlarge c3.8xlarge cc2.8xlarge cr1.8xlarge r3.2xlarge r3.4xlarge r3.8xlarge in a single AZ
We’re already architected the application to be resilient to instance termination. However, while we might have minimize the impact of an instance termination we can use the two minute warning to take it a step further. This time we will capitalize on the two minute warning by invoking your shutdown scripts.
A script like the following can be placed in a loop and can be run on startup (e.g. via systemd or rc.local) to detect for Spot instance termination. It can then update job task state in DynamoDB and re-insert the job task into the queue if required. We recommend that applications poll on the termination notice at five-second intervals.
If your HDFS cluster lives for less than 6 hours Spot blocks is an option to save up to 50% with just one additional parameter to the EC2 Spot API.
Another option is to use EMR’s built in file system to write directly to S3. You don’t need to scale HDFS nodes, pay only for the storage you use and it is massively parallel.
To begin saving significantly with your Hadoop cluster all you need to do
Run task nodes separately with EC2 Spot fleet
Consider Spot blocks for core/HDFS nodes
EC2 Spot fleet takes the complexity out of
Heterogeneous instance management
Scale on the unit that matters to you
Accelerate results, time is money: there is no opportunity cost to scaling up and finishing early. Get your jobs done faster and still save up to 90%.
With cloud infrastructure time is money. For your guys faster is better, they’re more valuable. Now the only bottleneck is your code. Spot also opens the opportunity to kick through your low priority queue.
Whether it is stateless web servers, API tiers, micro-services etc.
I think most in the audience would be familiar with this simplified web architecture. However, we’re using fleet here so this group of nodes has different characteristics than an ASG for example.
Why does it have to be different to a normal ASG? As we’ve discussed there are multiple independent markets available in Spot. These markets are NOT correlated. Customers have for a long time followed a diversification strategy for time sensitive, mission critical workloads. With fleet you can scale. With Spot fleet we’ve made it easy. E.g. if you can use the 5 instances above across 2 availability zones we know that any one price fluctuation will only impact 1/8 of our capacity or 12.5%. Much like the index fund.
A lot of customers running behind an ELB prefer to maintain similar resources on servers. If we take the c3.large instance as our ‘base’ we can get a total of 5 instance just be going across families. C3.large, c4.large, m3.large, m4.large and r3.large. As we’re running a website and have designed for Multi-AZ we will use all available.. Here we have selected them so we can specify the subnets we’re running our service in.
As you can see because we’ve selected instances with similar performance characteristics we’ve assigned them all a weight of one. In this example we are scaling on instances. As I mentioned earlier for those of you who prefer you can print out the CLI version the console has created.
If I requested 50 instances, with fleet enabled to run with c3.large, m3.large, m4.large, c4.large, r3.large across two availability zones.
Capacity never dropped below 45 and stayed pretty close to 50. In fact it was running 50 instances over 99.9% of the time over 30 days.
Average price per core is 0.0098 – 85% off on-demand
Distributing capacity across large number of pools provides great results even without any active management strategies – This is a case where you might normally be using on-demand, this is set-up and forget it.
over last 90 days how many instances would I have had at any point in time and what is the price?
Some additional considerations I’ll cover briefly.
Options for shifting state off web/app servers
Load balancing a fault tolerant application with ELB
Capitalizing on the Two Minute Warning
Cross-zone load balancing - Cross-zone load balancing reduces the need to maintain equivalent numbers of back-end instances in each Availability Zone, and improves your application's ability to handle the loss of one or more back-end instances. However, we still recommend that you maintain approximately equivalent numbers of instances in each Availability Zone for higher fault tolerance.
Because there is a 2 minute warning on Spot we recommend establishing a timeout of 90 seconds for connection draining.
Must manual attach Spot instances with user data in current iteration of fleet.
We’re already architected the application to be resilient to instance termination. However, while we might have minimize the impact of an instance termination we can use the two minute warning to take it a step further. As I mentioned we can capitalize on the two minute warning by detaching it from an ELB set to drain connections. To do that we recommend checking the instance meta data regularly, about every 5 seconds, for the two minute warning.. Then.
Here is a simple sample of what some customers will back into their AMI, or bootstrap actions. This small script checks for a instance termination notice (404 will be returned if you aren’t in the two minute warning) then detaches itself from the ELB if that two minute warning is active.
If you’re current using ASG and would like to scale your capacity using Spot you’re not alone. There are many reasons to use Spot as part of a broader ASG strategy including RI’s and OD. i.e. to scale to meet peak load, address a marginal cost vs benefit equation i.e. if you have 2 servers that strain under heavy load but it doesn’t warrant the cost of a 3rd running on-demand.
It is actually as easy as a tick box in the console during setup, but first we check the EC2 Spot Bid Advisor. Then selecting your bid price. I’ve selected the EU-West-1 region and place at bid at the on-demand price. You can see the 3 markets (one per AZ) for c3.large are approximately 83% cheaper than OD.x You can also copy another ASG config and simply add ‘spot’ to it.
And that’s exactly what we see within every large scale organizations. A single company can have different units within a single company. Whether it be the data science or quant team, discovering new opportunities to optimize supply chain using 1c per core hour compute via EC2 Spot instances. There are teams building net new applications trying to build the next million (or billion) dollar business unit, here is one that may have been tracking costing just a $100 a week but then one new line takes off rapidly and capacity dynamically scales to meet customer demand. Of course there is the traditional IT arm, delivering internal services like HR systems making heavy use of reserved instances. Finally test and development as you got through the process to deliver new versions, making use of Spot, On-Demand and reserved instances.
There are 3 different ways to get started today, you can do all three but please do at least one.
Try out bid advisor
Try the NEW Spot Console
If you’ve not used Spot at all, block some time and save up to 50%