EC2 Spot Instances allow you to save up to 90% of cost by bidding on spare Amazon Elastic Compute Cloud (Amazon EC2) instances. We will review Amazon EC2 Spot Instances, the benefits, new key features, and best practices on getting started and optimizing your cost savings. We will walk through popular EC2 Spot and AWS Lambda use cases, such as batch processing and web architectures.
Learning Objectives:
• Understand EC2 Spot Features, including recent new product updates
• How to set up a stateless web tier with EC2 Spot while maintaining high availability
• Integrate BI tools such as Tableau and R to generate custom reports
• How to set up a large scale batch processing architecture with EC2 Spot
• Identify sample use cases, best practices, and tips for using EC2 Spot
Who Should Attend:
• Security Administrators, IT auditors, Devops Engineers and Developers
2. Name your own price for EC2
Compute
• A market where price of compute
changes based upon Supply and
Demand
• When Bid Price exceeds Spot
Market Price, instance is launched
• Instance is terminated (with 2
minute warning) if market price
exceeds bid price
• Unused On-Demand Instances
What is Spot?
3. • Spot prices are determined via supply and demand
• There are hundreds of uncorrelated Spot markets
• Prices can, but often don’t fluctuate wildly
About Spot…
4. General-purpose: M1, M3 , T2
Compute-optimized: C1, CC2, C3, C4
Memory-optimized: M2, CR1, R3, M4
Dense-storage: HS1, D2
I/O-optimized: HI1, I2
GPU: CG1, G2
Micro: T1, T2
.micro
.medium
.large
.xlarge
.2xlarge
.4xlarge
.8xlarge
Windows
Linux
-1a
-1b
-1c
….
Type Size OS AZ
Spot is not one market
5. Each instance family (r3) and size (4xlarge),
in each Availability Zone (US-East-1b)
Uncorrelated pools of Spot Capacity
11. Check the Price History
Describe Spot Price History API:
• Provides historical prices on a per-pool basis
• Goes back 90 days (3 months)
• Popular instance types tend to have Spot prices that are
somewhat more volatile
• Older generations (including c1.8xlarge, m1.small,
cr1.8xlarge, and cc2.8xlarge) tend to be much more
stable and have lower cost in general
12. Capacity pools
Set of EC2 instances of the same properties:
• Availability zone
• Product/Operating system (Linux/Unix or Windows)
• EC2 instance type
Each EC2 capacity pool has it’s own:
• Availability – number of Spot instances
• Price – based on supply and demand
13. Use Multiple Capacity Pools
• Run applications across multiple capacity pools to
reduce your application’s sensitivity to price spikes that
affect a pool
• In general, there is very little correlation between prices
in different capacity pools.
• For example, if you run in five different pools your price
swings and interruptions can be cut by 80%.
14. Use Multiple Capacity Pools
Run across multiple availability zones in conjunction
• Auto Scaling
• Spot Fleet API
Run application across different sizes of instances within
the same family
• Amazon EMR takes this approach
Your application could figure out how many vCPUs it is
running on, and then launch enough worker threads to
keep all of them occupied.
15. CPU and cores
• What kind of performance does your application require?
How many cores does your application need?
Memory/core
• How much memory per core does your application need?
Networking
• Does your application need high, moderate, low network
bandwidth?
Disk
• How much local disk does your application need?
Use Normalized pools of Compute
16. You only pay what the Market price is
But, bid what you are willing to pay
You pay for the price as you enter the hour
And pay for it at the end of the hour
If you get interrupted, you don’t pay for that hour
Bid only what you are willing to pay.
(by default, bid limited to 10 * On Demand Price)
What about Bidding Strategy?
17. AWS Spot Labs
• https://github.com/awslabs/aws-spot-labs
Helps to find capacity pools (defined as instance type and AZ) with
lower price volatility by ordering these pools based on duration of time
since the Spot price last exceeded the bid price. It uses AWS CLI to
programmatically obtain Spot price history data.
Finding the best pools of Compute Capacity
18. python get_spot_duration.py
--region us-east-1
--product-description 'Linux/UNIX'
--bids
c3.xlarge:0.105,c3.2xlarge:0.21,c3.4xlarge:0.42,c3.8xlarge:0.84,c4.xlarge:0.110,c4.2xlarge:0.220,c4.
4xlarge:0.440,c4.8xlarge:0.880,cc2.8xlarge:1.000,c1.xlarge:0.26
--hours 168
Note:
• Price as of 8/15/2015
• AZ mappings may differ
• 168 hours = 1 week
• In this example, bidding
the on-demand price
Using the Spot Tools Lab
19. Build stateless, distributed, scalable applications
Choose which instance types fit your workload the best
Ingest price feed data for AZs and regions
Make run time decisions on which Spot pools to launch in based on
price and volatility
Manage interruptions
Monitor and manage market prices across Azs and instance types
Manage the capacity footprint in the fleet
And all of this while you don’t know where the capacity is
Serve your customers
Helping with the undifferentiated heavy lifting
UNDIFFERENTIATED
HEAVY LIFTING
20. Instead of writing all that code to manage Spot Instances,
simply specify:
Target Capacity - The number of EC2 instances that you want in
your fleet.
Maximum Bid Price - The maximum bid price that you are willing
to pay.
Launch Specifications - # of and types of instances, AMI id, VPC,
subnets or AZs, etc.
IAM Fleet Role - The name of an IAM role. It must allow EC2 to
launch and terminate instances on your behalf.
Introducing Spot Fleet
25. Stateless Web/App/API Architecture with Spot
Elastic Load
Balancing
Stateless
Web Servers
Stateless
Web Servers
On Demand Auto
Scaling group
Session
State Data
Stateless Web
Servers (Spot)
Stateless Web
Servers (Spot)
Spot Auto
Scaling group
Availability Zone A
Availability Zone B
Stateless Web
Servers (Spot)
Stateless Web
Servers (Spot)
Spot Auto
Scaling group
26. Web Application - Auto Scaling
Multiple Auto Scaling groups
• On-demand instances for fallback.
• Multiple EC2 Spot instance Auto Scaling groups
• Each Spot Auto Scaling group using a different capacity pool
(e.g. AZ, bid, Instance size, Instance type)
Auto Scaling groups behind the same Elastic Load
Balancer.
Pick the right instance time for the job based on the price
history.
27. Auto Scaling Policies
Aggressive scaling policies for Spot Auto Scaling Groups
e.g. Scale up at 75% CPU utilization and scale down when at
25% CPU utilization with a large capacity range)
More conservative scaling policies for On-Demand Auto
Scaling groups.
28. Session state for the web application can be stored in
DynamoDB.
• Data replicated across availability zones.
You can also choose other databases to maintain state in
your architecture.
• Amazon RDS using Multi-AZ deployments
• Amazon Elasticache
Where to store the state?
29. Spot termination considerations
Availability of Spot instances can vary based on supply and
demand
Architect application to be resilient to instance termination
When the Spot price exceeds the price you named (i.e. the
bid price), the instance will receive a two-minute warning
that the instance will be terminated
30. Spot termination considerations
Check for the 2 minute spot instance termination
notification every 5 seconds leveraging a script invoked at
instance launch. Upon notification:
• Place any session information into DynamoDB
• Use IAM roles so that the spot instances can de-register
themselves from the ELB upon termination notification
31. Since the Auto Scaling groups span across multiple
availability zones, we highly recommend enabling cross-
zone load balancing for the load balancer.
To allow in-flight requests to complete when de-registering
Spot instances that are about to be terminated, connection
draining can be enabled on the load balancer with a
timeout of 90 seconds.
Elastic Load Balancing
32. Sample script
#!/bin/bash
while true
do
if curl -s http://169.254.169.254/latest/meta-data/spot/termination-
time | grep -q .*T.*Z; then instance_id=$(curl -s
http://169.254.169.254/latest/meta-data/instance-id); aws elb deregister-
instances-from-load-balancer --load-balancer-name my-load-balancer --
instances $instance_id; /env/bin/flushsessiontoDBonterminationscript.sh;
else
# Spot instance not yet marked for termination.
sleep 5
fi
done
33. Web Application Architecture with Spot
Elastic Load
Balancing
Stateless
Web Servers
Stateless
Web Servers
On Demand Auto
Scaling group
Session
State Data
Stateless Web
Servers (Spot)
Stateless Web
Servers (Spot)
Spot Auto
Scaling group
Availability Zone A
Availability Zone B
Stateless Web
Servers (Spot)
Stateless Web
Servers (Spot)
Spot Auto
Scaling group
36. Batch oriented applications can leverage on-demand
processing using EC2 Spot to save up to 90% cost:
• Claims processing
• Large scale transformation
• Media processing
• Multi-part data processing work
You can also leverage EMR with spot instances.
Batch Processing with Amazon EC2 Spot
37. • Multi-part job processing architecture
• Auto Scaling groups to setup a heterogeneous, scalable
“grid” of EC2 spot instances with multiple capacity pools
as worker nodes
• Use S3 to invoke AWS Lambda upon object upload
• Use SQS for decoupling
• DynamoDB for tracking job status
• Complete large batch processing tasks in parallel
Batch Processing with Amazon EC2 Spot
38. About Lambda and SQS
AWS Lambda is a compute service that runs your code in
response to events and automatically manages the
compute resources for you, making it easy to build
applications that respond quickly to new information.
Amazon Simple Queue Service (SQS) is a fast, reliable,
scalable, fully managed message queuing service to
decouple components.
Depending on the application’s needs, multiple SQS queues
might be required for functions and priorities.
39. Batch Processing with Amazon EC2 Spot
On Demand Auto-
Scaling group
Output S3
bucket
Spot Auto-
Scaling group 2
Availability Zone A
Availability Zone B
Spot Auto-
Scaling group 1
Upload object
into input S3
bucket
Job SQS Queue
Auto Scaling groups will scale up based
on queue depth and scale down based on
CPU utilization CW metrics
Workers will
check for
jobs in the
queue
Workers will update Job status
(start time, SLA end time, etc)
in DynamoDB
Uploads to S3 will
trigger a Lamda
function to put jobs in
SQS and DynamoDB
EFS
EC2 instance
worker fleet
41. AWS Lambda function for SQS and DynamoDB
updates
// dependencies
var AWS = require('aws-sdk');
// get reference to clients
var s3 = new AWS.S3();
var sqs = new AWS.SQS();
var dynamodb = new AWS.DynamoDB();
console.log ('Loading function');
exports.handler = function(event, context) {
// Read options from the event.
var srcBucket = event.Records[0].s3.bucket.name;
// Object key may have spaces or unicode non-ASCII characters.
var srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/+/g, " "));
42. // prepare SQS message
var params = {
MessageBody: 'object '+ srcKey + ' ',
QueueUrl: 'https://sqs.us-east-1.amazonaws.com//demojobqueue',
DelaySeconds: 0
};
//send SQS message
sqs.sendMessage(params, function (err, data) {
if (err) {
console.error('Unable to put object' + srcKey + ' into SQS queue due to an error: ' +
err);
context.fail(srcKey, 'Unable to send message to SQS');
} // an error occurred
else {
//define DynamoDB table variables
var tableName = "demojobtable";
var datetime = new Date().getTime().toString();
AWS Lambda function for SQS and DynamoDB
updates
43. //Put item into DynamoDB table where srcKey is the hash key and datetime is the range key
dynamodb.putItem({
"TableName": tableName,
"Item": {
"srcKey": {"S": srcKey },
"datetime": {"S": datetime },
}
}, function(err, data) {
if (err) {
console.error('Unable to put object' + srcKey + ' into DynamoDB table due to an error: '
+ err);
context.fail(srcKey, 'Unable to put data to DynamoDB Table');
}
else {
console.log('Successfully put object' + srcKey + ' into SQS queue and DynamoDB');
context.succeed(srcKey, 'Data put into SQS and DynamoDB');
}
});
}
});
};
AWS Lambda function for SQS and DynamoDB
updates
44. Batch Processing with Amazon EC2 Spot
• Worker nodes get job parts from the SQS and perform
single tasks based on the job task state in DynamoDB
• Store the input objects in a file system such as Amazon
Elastic File System (Amazon EFS), local instance store
or Amazon Elastic Block Store (EBS)
• Each job can be further split into multiples sub-parts if
there is a mechanism to stitch the outputs together
• Once completed, the objects will be uploaded back to S3
using multi-part upload.
45. Batch Processing with Amazon EC2 Spot
On Demand Auto-
Scaling group
Output S3
bucket
Spot Auto-
Scaling group 2
Availability Zone A
Availability Zone B
Spot Auto-
Scaling group 1
Upload object
into input S3
bucket
Job SQS Queue
Auto Scaling groups will scale up based
on queue depth and scale down based on
CPU utilization CW metrics
Workers will
check for
jobs in the
queue
Workers will update Job status
(start time, SLA end time, etc)
in DynamoDB
Uploads to S3 will
trigger a Lamda
function to put jobs in
SQS and DynamoDB
EFS
EC2 instance
worker fleet
46. More automation?
Use a Lambda function to dynamically manage Auto
Scaling groups based on the Spot market
• The Lambda function could periodically invoke the EC2 Spot
APIs to assess market prices and availability and respond by
creating new Auto Scaling launch configurations and groups
automatically.
• This function could also delete any Spot Auto Scaling groups
and launch configurations that have no instances.
AWS Data Pipeline can be used to invoke the Lambda
function using the AWS CLI at regular intervals by
scheduling pipelines
47. Automated Batch Architecture with Spot
Worker
Worker
On Demand
Autoscaling group
Output S3
bucket
Worker (spot)
Worker(spot)
Spot Autoscaling
group 2
Availability Zone A
Availability Zone B
Worker(spot)
Worker (spot)
Spot Autoscaling
group 1
Upload object
into input S3
bucket
Job SQS Queue
AutoScaling groups will scale up
based on queue depth and scale
down based on CPU utilization
CW metrics
Workers will
check for
jobs in the
queue
Workers will update Job status
(start time, SLA end time, etc)
in DynamoDB
DataPipeline can invoke a Lambda
function in a scheduled manner
which can manage AutoScaling
groups based on the spot market
Uploads to S3 will
trigger a Lamda
function to put jobs in
DynamoDB and SQS EFS
48. Further cost optimization with Trusted Advisor
Save money on AWS by eliminating unused and idle resources
Cost Optimization TA Checks:
• Amazon EC2 Reserved Instances Optimization
• Low Utilization Amazon EC2 Instances
• Idle Load Balancers
• Underutilized Amazon EBS Volumes
• Unassociated Elastic IP Addresses
• Amazon RDS Idle DB Instances
49. AWS re:Invent 2015 – October 6-9
AWS re:Invent is the largest annual gathering of the global cloud community. Whether you are an existing customer or new
to the cloud, AWS re:Invent will provide you with the knowledge and skills to refine your cloud strategy, improve developer
productivity, increase application performance and security, and reduce infrastructure costs.
Though AWS re:Invent tickets are sold out, you can still register to view the Live Stream Broadcasts of the keynote
addresses and select technical sessions on October 7 and October 8. Register now.
Details:
Wednesday, October 7
9:00am - 10:30am PT: Andrew Jassy, Sr. Vice President, AWS
11:00am - 5:15pm PT: 5 of the most popular breakout sessions (to be announced)
Thursday, October 8
9:00am - 10:30am PT: Dr. Werner Vogels, CTO, Amazon
11:00am - 6:15pm PT: 6 of the most popular breakout sessions (to be announced)
Register now for the Live Stream Broadcast by submitting your email where prompted on the AWS re:Invent home page.
Stay Connected: Follow event activities on Twitter @awsreinvent (#reinvent), or like us on Facebook.
52. EBS
Submit jobs, orchestrate
HPC clusters over VPC
Run 1 Million drive head
designs = 70.75 core-years
90x throughput:
Ran in 8 hours, not 30 days
3 days from idea to running
70,908 cores, 729 TFLOPS
c3, r3 with Intel E5-2670 v2
Cost: $5,594
Spot Instances
New Drive
Head
Design
Workloads
World’s Largest F500 Cloud Run
Transforming drive design to store the world’s data
Encrypt, route data to
AWS, return results
Cluster
70,908 Cores
with
Spot
Instances
53. AWS Delivered Unheard-of Processing
39 years of science
10,600 AWS Instances
Saved equivalent of $40M infrastructure
10 Million compounds screened
39 drug design years in 11 hours for a cost of… $4,232
3 promising compounds identified
54. Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/
Bloomreach
launches 1,500 to
2,000 Amazon EMR
clusters and run
6,000 Hadoop jobs
every day.
55. Continuous Integration & Testing with Spot
• Tapjoy - Premier Mobile Ad Network Across iOS & Android
• Global Network (435 Million Monthly Reach)
• Jenkins + Spot Instances
• https://github.com/bwall/ec2-plugin (thanks to an RIT senior project)
• Go wide during business hours, scale back in the evenings.
Automatically kicks online at 06:00ET
• Workers scale horizontally to support dozens of simultaneous regression
tests spread out over dozens of workers
• Jenkins automatically guards against spot termination
56. Ooyala
• Video technology platform
that serves ESPN,
Bloomberg, ...
• Uses combo of OD/RI/Spot to
ensure it can cover predicted
volumes while keeping costs
low
• http://aws.amazon.com/solutions/case-
studies/ooyala/
Vevo
• Library of over 75,000 HD
videos
• Must be able to rapidly
transcode library to a new
screen format
• Can spin up 100s of Spot
instances to transcode entire
library in a matter of days
(instead of the weeks)
Queue-based media transcoding