Slide: AWS Purchase Models
As shown by the previous slide, it is possible to launch significant amounts of compute power for a low cost. Customer have several models available when using Amazon EC2.
- Cover the three pricing models on the slide
On demand is the easiest way to get started with AWS. No commitment, pay as you go.
Reserved instances provide a significant discount in exchange for a commitment to use the services for some period of time, either 1 or 3 years. Reserved instances also come with an actual capacity reservation, which can be important for large enterprises who need a high level of assurance that computing resources will be available when they are needed.
Spot instances are a unique and powerful pricing model, in particular for HPC. With Spot, customers can bid on unused AWS capacity and are often able to launch instances on the cloud for as little as 10% of the equivalent on-demand rate. The tradeoff for Spot is if other customers are willing to pay more than you for the same AWS instance type, or capacity of that type becomes constrained, your running jobs may be terminated without warning. Jobs running on Spot therefore need to be fault-tolerant, or able to be restarted again at a later time.
What spare capacity looks like at scale.
AWS has more than a million active customers in 190 countries.
Amazon EC2 instance usage has increased 93% YoY, comparing Q4 2014 and Q4 2013, not including Amazon use.
Amazon S3 holds trillions of objects and regularly peaks at millions of requests per second.
So with EC2 Spot the rules are actually really simple.
Rule 1: The Spot market is where price of compute fluctuations based on supply and demand.
Rule 2: You’ll never pay more than your bid, in fact you’ll only ever pay the market price. When the market price exceeds your bid you get 2 minutes to wrap up.
Market price is on average 85% lower than On-Demand prices
What is in a market.. This is one of the most important, and unfortunately misunderstood elements of how the spot market works. While we say Spot market there are actually hundreds of Spot markets available to all our customers. AWS has 11 (?) regions around the world, in each region there are multiple availability zones and multiple instance families and multiple instance sizes per family.. (START CLICK THROUGH and READ). E.g. c3. e.g. large, xlarge, 8xlarge, e.g. US-West-2a, US-West-2b, e.g. Dublin Region, Oregon Region, Sydney Region.
Now that we understand what a spot market is and that there are many I’ll explain how we acquire the capacity. I’m going to pick just one market to highlight this. There are two numbers you care about with Spot.
Bid price. Think of this as the cap, the maximum you’re willing to pay for a given instance per hour.
Market price. This is the price you pay. Market price is set by periodic auctions
The r3.4xlarge costs $1.4 under our On-Demand purchasing option.
See it in action via 3 bids. 25%, 50%, 75%. Single Zone.
25% you kept your instance for almost 7 days, being impacted during a few short periods. However, you only paid the market price which was 86% off, just less than 20c per hour during the last week, only 14% of the OD price.
At 50% you would have been interrupted just once, for a very short period of time during the sixth day. You’re average discount during the week is 85% just 21c per hour, paying just 15% of OD.
At 75% you would not once have been interrupted, achieving an average discount of 85% just 21c an hour, again paying just 15% of OD.
1st - Check out the Spot Bid Advisor, which we launched earlier this year to guide customers in finding the resources, discount and instance lifetime they need.
The bid advisor has helped many new customers discover what some already knew. That with deliberate instance pool selection it can be straight forward to begin using Spot.
Take this is a snap I took from the tool last week and it shows that even at a 50% max bid there many different Spot markets that would have gone uninterrupted for over a week, while they got an average discount over 80-90%!
Now you might realize, wouldn't it be great if I could automate using all the pools that suit my application? Lets not get ahead of ourselves. First we need to understand, what is a Spot market?
1st - Check out the Spot Bid Advisor, which we launched earlier this year to guide customers in finding the resources, discount and instance lifetime they need.
The bid advisor has helped many new customers discover what some already knew. That with deliberate instance pool selection it can be straight forward to begin using Spot.
Take this is a snap I took from the tool last week and it shows that even at a 50% max bid there many different Spot markets that would have gone uninterrupted for over a week, while they got an average discount over 80-90%!
Now you might realize, wouldn't it be great if I could automate using all the pools that suit my application? Lets not get ahead of ourselves. First we need to understand, what is a Spot market?
We will first run through what the ‘best practices’ for EC2. While these are not necessary, they’re what the most sophisticated customers do to get high performance, high availability and low costs.
Standard practice
Stateless
Fault tolerant
Multi-AZ
SOA/Loosely coupled design
Spot Practice
Be instance flexible
This can mean c3.large, c3.xlarge,..r3.large
Or m3.large, r3.large, c3.large (ELB)
No seriously, your application can work with other instances (use example, drive this message home hard).
You use c3.xlarge and you can’t AT all use c3.2xlarge? Really? Really? Even if we give you 70% off for twice the c3.xlarge specs?
Lyft: Savings $15K per month with 4 lines of code. After using Spot in CICD Lyft recognized the stability of the platform, and the opportunity to leverage it as part of their Hadoop stack (run by Qubole) arose. They’ve since been able to shift more than a third of their Qubole managed Hadoop cluster onto EC2 Spot. Saving even further.
Brookhaven Labs, ATLAS experiment needed instances to live for as much as 24 hours in order to add value. Some software simply cannot check point. They needed the equivalent of 50,000 physical cores to meet the 1500 scientific researchers demand for resources. It takes a trillion proton collisions in the collider to produce evidence of a single Higgs boson particle’s decay. Over 5 days, less than 1% of instances were terminated, leaving them with a significant margin of safety. Instead of building a 50,000 core data center they were able to successfully use AWS Spot for 5 days and pay just $45,000.
ATLAS - The experiment is designed to take advantage of the unprecedented energy available at the LHC and observe phenomena that involve highly massive particles which were not observable using earlier lower-energy accelerators. It is hoped that it will shed light on new theories of particle physics beyond the Standard Model.
Spot is a powerful economic reward for fault tolerant, cloud first architectures. How powerful? Examples
Novartis: 39 years of drug research re-processed, using over 80,000 cores, in 9 hours for $4,232.
Lyft: Savings $15K per month with 4 lines of code
Adroll: Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances
Hopefully many of you have come across the EC2 Spot fleet API. This one weird API makes it easy to:
Launch 1,2 or 3000 Spot instances with one API call
You can select whether you’d like to put your capacity into the single cheapest market,
Or opt to diversify to minimize the impact of any individual Spot market
Finally, by introducing Weights you can now scale based on the metric that matter most to you. It might be cores, memory, instances, latency.. It is your call.
Why does it have to be different to a normal ASG? As we’ve discussed there are multiple independent markets available in Spot. These markets are NOT correlated. Customers have for a long time followed a diversification strategy for time sensitive, mission critical workloads. With fleet you can scale. With Spot fleet we’ve made it easy. E.g. if you can use the 5 instances above across 2 availability zones we know that any one price fluctuation will only impact 1/8 of our capacity or 12.5%. Much like the index fund.
Just 5 months ago we launched fleet and have continued the AWS trend of rapid innovation based on customer feedback. We’ve launched 4 major features to spot fleet over the last 5 months and we’re nowhere near finished. We’ve also made it so easy!
We’re already architected the application to be resilient to instance termination. However, while we might have minimize the impact of an instance termination we can use the two minute warning to take it a step further. As I mentioned we can capitalize on the two minute warning by detaching it from an ELB set to drain connections. To do that we recommend checking the instance meta data regularly, about every 5 seconds, for the two minute warning.. Then.
Here is a simple sample of what some customers will back into their AMI, or bootstrap actions. This small script checks for a instance termination notice (404 will be returned if you aren’t in the two minute warning) then detaches itself from the ELB if that two minute warning is active.
First we will do a rapid review of Best practices, then step you through the new ways to leverage Spot for Hadoop, web applications and batch processing.
Batch has long been in the wheelhouse for Spot usage. Customers have been using Spot in
Monte Carlo simulations in risk analytics for insurance and finserv (Ufora)
Molecular modeling (Novartis)
Media rendering Animation and FX rendering, and batch image processing pipeline (FinDesign)
High energy simulations (Brookhaven)
They’ve found it valuable to accelerate processing and results. To run simulations that are otherwise cost prohibitive. To train algorithms at the lowest possible price. To achieve the scale they need i.e. . For example, an engineer running electromagnetic simulations could run larger numbers of parametric sweeps than would otherwise be practical, by using very large numbers of Amazon EC2 Spot Instances (and/or OD instances), and using automation to launch independent and parallel simulation jobs.
There are numerous batch oriented applications in place today that can leverage this style of on-demand processing, including claims processing, large scale transformation, media processing and multi-part data processing work. There are many different architectures for Batch processing architecture because while components here are certainly useful as a guide there are lots of different approaches here. However at a high level there are some common methods using AWS services
1000 vCores, at an average saving of 80% off On-Demand. While some capacity fluctuated we had our desired capacity of 1000 for over 98% of the time. During the 30 days we were never more than 4%, or 40 cores below our desired capacity while maintaining an average of 1012 cores.
Instances used - c3.2xlarge c3.4xlarge c3.8xlarge cc2.8xlarge cr1.8xlarge r3.2xlarge r3.4xlarge r3.8xlarge in All AZs
I think most in the audience would be familiar with this simplified web architecture. However, we’re using fleet here so this group of nodes has different characteristics than an ASG for example.
If I requested 50 instances, with fleet enabled to run with c3.large, m3.large, m4.large, c4.large, r3.large across two availability zones.
Capacity never dropped below 45 and stayed pretty close to 50. In fact it was running 50 instances over 99.9% of the time over 30 days.
Average price per core is 0.0098 – 85% off on-demand
Distributing capacity across large number of pools provides great results even without any active management strategies – This is a case where you might normally be using on-demand, this is set-up and forget it.
over last 90 days how many instances would I have had at any point in time and what is the price?
There are numerous batch oriented applications in place today that can leverage this style of on-demand processing, including claims processing, large scale transformation, media processing and multi-part data processing work. There are many different architectures for Batch processing architecture because while components here are certainly useful as a guide there are lots of different approaches here. However at a high level there are some common methods using AWS services
Some additional considerations I’ll cover briefly.
Options for shifting state off web/app servers
Load balancing a fault tolerant application with ELB
Capitalizing on the Two Minute Warning
Some additional considerations I’ll cover briefly.
Options for shifting state off web/app servers
Load balancing a fault tolerant application with ELB
Capitalizing on the Two Minute Warning
I mentioned Novartis at the beginning who back in2013, ran a project that involved virtually screening 10 million compounds against a common cancer target in less than a week. They calculated that it would take 50,000 cores and close to a $40 million investment if they wanted to run the experiment internally. Partnering with Cycle Computing and Amazon Web Services (AWS), Novartis built a platform leveraging Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (Amazon EBS), and four Availability Zones. The project ran across 10,600 Spot Instances (approximately 87,000 compute cores) and allowed Novartis to conduct 39 years of computational chemistry in 9 hours for a cost of $4,232. Out of the 10 million compounds screened, three were successfully identified.
Schrodinger in their quest for computational chemistry for better solar power stood up a 156,314 core cluster. The estimated computation time to process the 205,000 organic compounds was 264 years, but was completed in 18 hours. They achieved 1.21 petaFLOPS (Rpeak) for just $33,000 or 16¢ per molecule.
I mentioned Novartis at the beginning who back in2013, ran a project that involved virtually screening 10 million compounds against a common cancer target in less than a week. They calculated that it would take 50,000 cores and close to a $40 million investment if they wanted to run the experiment internally. Partnering with Cycle Computing and Amazon Web Services (AWS), Novartis built a platform leveraging Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (Amazon EBS), and four Availability Zones. The project ran across 10,600 Spot Instances (approximately 87,000 compute cores) and allowed Novartis to conduct 39 years of computational chemistry in 9 hours for a cost of $4,232. Out of the 10 million compounds screened, three were successfully identified.
Schrodinger in their quest for computational chemistry for better solar power stood up a 156,314 core cluster. The estimated computation time to process the 205,000 organic compounds was 264 years, but was completed in 18 hours. They achieved 1.21 petaFLOPS (Rpeak) for just $33,000 or 16¢ per molecule.