2. Spot 101
Workloads running on Spot therefore need to be fault-tolerant, or
able to be restarted again at a later time.
Customers can bid on unused AWS capacity.
The tradeoff is if capacity becomes constrained, your Spot
Instances may be terminated after a two-minute warning.
3. Spot 101
Each instance family, each instance size, each availability zone in
each region is a separate spot market.
The Spot market is where price of compute fluctuations based on
supply and demand.
You’ll never pay more than your bid - you’ll only ever pay the market
price. When the market price exceeds your bid, you get two minutes
to wrap up.
6. A Review of Spot Fleet
Launch many spot instances with one call.
Select whether you want instances in the cheapest market, or opt to
diversify to reduce impact of market variability.
Weighting allows you to scale based on cores, memory, latency,
etc.
This diversification option is what we are using to maximise availability!
7. Challenge 1 – ELB Registration
Q: How do we register a Spot Fleet Instance to an ELB/ALB?
A: Use EC2 User Data.
aws elb register-instances-with-load-balancer
--load-balancer-name my-loadbalancer --instances $instance_id;
As there is no mechanism to automatically register a Spot Instance provisioned via a Spot Fleet Request
with an ELB/ALB, this needs to be implemented to distribute load across the fleet.
8. Challenge 2 – EIP Attachment
Q: How do we associate an Elastic IP Address to a Spot Fleet Instance?
A: Use EC2 User Data.
aws ec2 allocate-address --domain vpc;
aws ec2 associate-address --instance-id $instance_id
--allocation-id $eip_allocation_id;
This will be required for services that need direct connectivity to the Internet such as NAT hosts and
proxy servers.
But it’s a bit more complicated…
An alternative is to enable automatic public IP addressing – but this is a VPC-wide setting.
This use case has been raised with the service team and a feature request has been created.
9. Challenge 3 – De-registration from an ELB
Q: How do we de-register a terminated Spot Instance from an ELB/ALB?
A: Run a script that polls the instance metadata for a termination time.
while true
do
if curl -s http://169.254.169.254/latest/meta- data/spot/termination-time |
grep -q .*T.*Z;
then instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id);
aws elb deregister-instances-from-load-balancer
--load-balancer-name my-load-balancer
--instances $instance_id;
else
sleep 5
fi
done;
We need a mechanism to ensure that ELB traffic is not routed through to an instance that’s about to be
terminated.
This is a well documented process.
10. Challenge 3 – De-registration from an ELB
A (v2.0): Use a Lambda function to deregister the instance upon termination.
The previous solution does not cater for instances that are manually terminated. Also, it cannot be tested.
Finally, we can “think bigger” and build a solution that applies to ALL instances – on-demand or spot.
When an EC2 instance is terminated, an EC2 Instance State-change Notification event is raised in
CloudWatch Events. When an EC2 Instance changes state, a Lambda function can be executed.
This function would first check that the instance was terminated, then would de-register the instance.
Q: But how do we ensure that requests don’t get directed to an instance that’s
about to be terminated?
Utilise existing health check functionality within the ELB/ALB.
When our termination notice is posted to our Spot Instance, poison the health check!
Alternatively, you can use a scheduled CloudWatch Event to routinely initiate the Lambda function.
11. Challenge 4 – Spot Price Variability
Q: What happens if the Spot market price goes up beyond our bid price?
A: Handle the outage or run on-demand instances in parallel.
This needs to be considered if we are to have any guarantee of service, especially for production
environments.
Deploying diversified Spot Fleets helps greatly here, but there is still a chance that ALL markets in the
fleet could be outbid… Running on-demand instances in parallel is the typical answer.
But surely there is a better way?
12. Challenge 4 – Spot Price Variability
A (v2.0): Pre-empt the market.
Each Spot Fleet request has an associated CloudWatch Metric called “EligibleInstancePoolCount”, which
enumerates how many pools that a Fleet Request could fulfil a request from.
We can configure a CloudWatch Alarm that triggers when the number of eligible pools drops below a
certain threshold – say, 2 pools.
Our On-Demand instances running in parallel can now be replaced with an AutoScaling group that
typically has no instances running. When the alarm triggers, a Lambda functions is invoked to manipulate
the AutoScling group configuration and provision On-Demand instances.
13. Challenge 5 – T2 Instances
Q: What happens if my workload runs on T2 Instance types?
A: Use the larger instance types
There is currently no Spot markets for T2 instance types - meaning that workloads may have to run on
m3.medium or larger instance types to take advantage of Spot.
Deploying Spot Fleets across a diverse range of pools means that we don’t need to be constrained by a
particular instance type. It’s very rare that a workload has adverse performance with more resources!
14. Challenge 5 – T2 Instances
Let’s look at an example. A t2.small instance type
has an hourly rate of $0.032. This instance type
features a single (burstable) vCPU and 2GiB of
memory.
Compare this with an m3.medium instance.
This instance type gives you a single (dedicated)
vCPU and 3.75GiB of memory.
More predictable performance and more memory at 40% of the cost!
In all three AZs in the Sydney region, the market
price of an m3.medium instance type has not
exceeded $0.020 over the last 3 months...
… and our average hourly rate in the most expensive AZ still ended up being $0.0128
15. Challenge 5 – T2 Instances
Now compare a t2.small with a m3.large instance.
The m3.large instance type gives you a two
(dedicated) vCPU and 7.50GiB of memory.
We can further diversify our fleet to also utilize m3.large instances in the spot market, and still make
savings over what would have been charged if we were running t2.small instances!
In all three AZs in the Sydney region, the market
price of an m3.large instance type has not
exceeded $0.0318 over the last 3 months...
16. Challenge 6 – Automation
Q: How do we automate all of this?
A: In steps the automation team
This needs to be considered if we are to have any guarantee of service, especially for production
environments.
The advantage to using CloudWatch Events and Alarms as triggers to Lambda functions is that the
Lambda functions should be able to be implemented in a single account and invoked by each stack.
That said, the team have spent a LOT of time working through the complexities of building Spot Fleet
requests in to CloudFormation stacks.
Development teams are still provided with same baseline CloudFormation templates (utilizing ASGs) that
have been provided in the past. However, a new tool has been written by the Automation team that takes
these baseline templates and converts them for use with Spot Fleets.
Given the recent launch of CloudFormation StackSets, we’re about to start looking at ways where this
can be further simplified.
17. The Solution Going Forward: Deploy
1. A deployment plan has been initiated.
We start with a standard ASG based template (ASG, ELB, Baked AMI-ID, SecurityGroups, Subnets etc).
18. 2. A lambda function is used to convert this ASG template into a skeleton Spot
Fleet resource request
The Solution Going Forward: Deploy
Lambda Function
(Conversion tool)
19. The Solution Going Forward: Deploy
Lambda Function
(InstanceTypes & Bids List)
3. A Lambda Function is triggered generating a list of instances types similar to
the one provided. It also calculates appropriate bid prices for those instance
types.
Lambda Function
(Conversion tool)
20. The Solution Going Forward: Deploy
Lambda Function
(InstanceTypes & Bids List)
Lambda Function
(Dynamic Spot Fleet Template)
4. The provided list is pushed to a third Lambda Function which dynamically
creates the Spot Fleet Template and uploads it to S3.
Lambda Function
(Conversion tool)
21. 5. Application Cloudformation Stack is created.
The Solution Going Forward: Deploy
Lambda Function
(InstanceTypes & Bids List)
Lambda Function
(Dynamic Spot Fleet Template)
Cloudformation
(Application Stack)
Lambda Function
(Conversion tool)
22. 1. The CloudFormation stack provisions an ELB/ALB and a Spot Fleet
Request is made.
The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer
23. The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer
2. The Spot Fleet Request is fulfilled and Spot Instances register with their
ELB via EC2 User Data.
24. The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer
3. If the market price for a Spot Instance exceeds the bid price, the Instance
is flagged for Termination. Health check on host is poisoned. Instance
marked as offline by ELB.
25. The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer
4. After two minutes, Spot Instance terminated. Scheduled CloudWatch Event
triggered, which initiates a Lambda function that ensures that unhealthy
instances are terminated and deregisters terminated instances from the ELB.
CloudWatch Event
(1 minute Scheduled Rule)
Lambda Function
(TerminateEC2Instance)
26. The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer CloudWatch Event
(1 minute Scheduled Rule)
Lambda Function
(TerminateEC2Instance)
5. A replacement Spot Instance is provisioned. Again, this Spot Instance
registers itself with the ELB.
27. The Solution Going Forward: Stack
Spot Fleet
Elastic Load Balancer CloudWatch Event
(1 minute Scheduled Rule)
Lambda Function
(TerminateEC2Instance)
6. If the number of pools that Spot Fleet can fulfil instances from gets low…
28. The Solution Going Forward: Stack
Spot Fleet On-Demand Fleet
Elastic Load Balancer
CloudWatch Alarm
(EligibleInstancePoolCount)
CloudWatch Event
(1 minute Scheduled Rule)
Lambda Function
(TerminateEC2Instance)
Lambda Function
(ModifyOnDemandCapacity)
7. … a CloudWatch Alarm will trigger a Lambda function that manipulates an
On-Demand AutoScaling group, which will commence provisioning On-
Demand EC2 Instances to maintain capacity for the workload.
29. The Solution Going Forward: Stack
Spot Fleet On-Demand Fleet
Elastic Load Balancer
CloudWatch Alarm
(EligibleInstancePoolCount)
CloudWatch Event
(1 minute Scheduled Rule)
Lambda Function
(TerminateEC2Instance)
Lambda Function
(ModifyOnDemandCapacity)
CloudWatch Alarm
(Pending Capacity > 0 for > 5 min)
8. If a spot instance has not been able to be provisioned for more than 5
minutes, an on-demand instance is also added.
30. Development Recommendations
Build Stateless Applications
If you can’t, persist state outside of the EC2 Instance using services such as DynamoDB,
RDS, Aurora, Elasticache, EFS or S3.
Poison Application Health Checks within the 2-minute warning period
Detect when a Spot Instance is scheduled for termination and cause the ELB/ALB to
think that the workload is out of service. That server will then be removed from the pool
of Healthy servers, allowing your application to gracefully handle a termination event.
Set your bid price appropriately
A bid price that is too low will introduce volatility in to your workload and reduce the
number of spot markets that you can draw instances from.
31. Reference Material
• EC2 Spot Instances - http://aws.amazon.com/ec2/spot/
• Spot Bid Advisor - http://aws.amazon.com/ec2/spot/bid-advisor/
• Getting Started with Spot - http://aws.amazon.com/ec2/spot/getting-started/
• Spot FAQs - http://aws.amazon.com/ec2/spot/faqs/
• Spot Testimonials - http://aws.amazon.com/ec2/spot/testimonials/
•
• Documentation: Using Spot Instances -
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
• Documentation: Spot Fleet -
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html
Any Questions?