6. Cloud Computing Benefits
No Up-Front Low Cost Pay Only for
Capital Expense What You Use
Self-Service Easily Scale Improve Agility &
Infrastructure Up and Down Time-to-Market
Deploy
7. Cloud Computing Fault-Tolerance Benefits
No Up-Front HA Low Cost Pay for DR Only
Capital Expense Backups When You Use it
Self-Service Easily Deliver Fault- Improve Agility &
DR Infrastructure Tolerant Applications Time-to-Recovery
Deploy
8. AWS Cloud allows Overcast Redundancy
Have the shadow
duplicate of your
infrastructure ready to go
when you need it…
…but only pay for what
you actually use
9. Old Barriers to HA
are now Surmountable
Cost
Complexity
Expertise
10. AWS Building Blocks: Two Strategies
Inherently fault- Services that are fault-tolerant
tolerant services with the right architecture
S3 Amazon EC2
SimpleDb
VPC
DynamoDB
Cloudfront EBS
SWF, SQS, SNS, SES RDS
Route53
Elastic Load Balancer
Elastic Beanstalk
ElastiCache
Elastic MapReduce
IAM
11. Resources
Deployment
The Stack: Management
Configuration
Networking
Facilities
Geographies
12. EC2 Instances
Amazon Machine Images
The Stack: CW Alarms - AutoScaling
Cloudformation - Beanstalk
Route53 – ElasticIP – ELB
Availability Zones
Regions
13. Regional Diversity
Use Regions for:
Latency
• Customers
• Data Vendors
• Staff
Compliance
Disaster Recovery
… and Fault Tolerance!
31. Storage Gateway
Your Datacenter
Amazon Elastic
Compute Cloud
(EC2)
AWS Storage
Gateway
VM SSL
Clients
Internet
On-premises Host or
Direct AWS Storage Amazon Simple
Connect Gateway Service Storage Service (S3)
Application
Servers Amazon Elastic
Block Storage
(EBS)
Direct Attached or Storage Area Network Disks
32. Test! Use a Chaos Monkey!
Prudent
Conservative
Professional
Open source
…and all the cool kids are doing it
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
We are going to talk today about building fault-tolerant systems, andlmorespecificically look at how AWS enables the cost effective and scalable design of these systems in ways that simply cannot be done otherwise.
So what types of faults are we trying to survive? If we stop and think about most applications there’s Really there are a wide array of different ways most applications can fail, the facitilits themselves, could have failures ranging from something extremely catasrophic like the building catching fire to something as simple as a power outage. Inside the facilities we are relying on a number of systems to be opperating, we have a network stack with routers, switches and firewalls, as well as servers and storage devices all of which very much have the ability to fail, either through hardware failures or configuraiton errors. And all of that is before we even get to the code for your applications and the peple that manage it, both of which are also potential areas of where failures can occur.
So what does fault tolerant mean, first its important to point out this sin’t an absolte there sin’t a magic easy button for this nor is there a one size fits all approach to building applications that can survive every possible failure, generally speaking there are costs associated with mitigating the risk of different types of failures as well as likelithoods of those fialuresoccuring so the design of these applications becomes an exercise in risk mitigation. For example of hard drives, the risk associated with a hard drive failing is pretty high compared to an entire datacenter being destroyed luckily the cost of mitigating against a failed hard drive is also far lower than building duplicate datacenters. The second bullet on there is very important, given that people, or human error, is probably the most common cause of failures for applications to truly be fualtfolerant they must leveage automation in the case of failure, this not only makes recovery much faster but also assures it happens in a known and controlled manner. And lastly if you don’t test your design, you won’t know if it works.
So here’s how we used to implimentfualt tolerance, it was really simple: build two of everything. Now there’s some signifigant problems with this approach, the obvious one is cost since your application just got 100% more expensive, and here in brazil which already has much higher server hw costs that can make the cost of mitigating against many types of failures impracticle for many applications. So what ends up happening is, again going back the risk mitigation idea someone is going to have to look at the cost of purchaings, maintaining and opperating a second instance of the applcaition and decide if its worth the cost.
I’m sure people are familiar with a lot of the commonly talked about beneifts of cloud computing from removing upfront capital costs to time to market and agility and in the area of fault tolerance these benefits all translate very well.
The upfront capital cost of adding a second server or mirroring stroage is gone for HA, backups are far simipler to use and extremely cost effective, and today with our release of Glacier which will revolutionize the way business backup and archive data that’s never been more true. From a DR perspective you can stage infrastructure and only launch and pay for it when its actually needed, versus paying for infrastrucuture 24/7 you hope to never use. Services that are part of AWS greatly simpliy making your applications highly available and fault tolerant, often at a m
With DR this becomes every evident, think of how often you actually use your DR site, hopefully your thinking of a really small number, now think of how youre paying for that. With AWS we have massive ammounts of infrastructure, in 8 different regions around the world and the ability of stage and programmically deploy infrastrcture to any of those regions. So you can stage your DR site and have it ready to spring into action if needed but only pay for what your actually using.
The next eveoltion beyond DR is HA, and HA has traditionally had a number of barriers that limited the applications that could be deployed in a HA manner, the first being cost but also from the standpoint of complexity, now this is different that DR where something is broke and you need to have some method of getting it back online either by using a second location with HA you want to have components be able to fail but have the system still opperate normally because there’s multiple servers that can perform that fuction or multiple online replicas of the data. In the traditional DC this can be very difficult and complex as well as costly, but with AWS we have built HA services that you can leverage which not only bends the cost curve but also makes its extremely simple to do.
As you can see here many of the servives we provide, are inherently fault-tolerant, we’ve done all the work to create them in a fashion that is resiliant to failure and highly durable so you don’t have to. So now if you need a fault-tolerant NoSQL DB you don’t have to worry about how to architect that you can simply use DynamoDB. So with the right design and by leveraging the services we provide that are inherently fault-tolerant you can focus on building your application rather than the infrastrcutre. Some of the services you see on the right are fault tolerant with the right architecture, and what we mean by that is we give you options on how you’re going to architect and deploy those services, RDS with mysql for example is fault-tolerant when Multi-AZ deployments are used since it will be replicating the data to multiple datacenters.
So we know there are opprotunities for failure at every layer of the stack, from disasters that affect entire geographies, or indidividualbuildingsall they up to the sever you’re application is running on. Now lets see how this translates in AWS and look at the service we have that provide fault-tolerance
At AWS we’ve built fault tolerant systems at every level of that stack.
Fault Separation Amazon EC2 provides customers the flexibility to place instances within multiple geographic regions as well as across multiple Availability Zones. Each Availability Zone is designed with fault separation. This means that Availability Zones are physically separated within a typical metropolitan region, on different flood plains, in seismically stable areas. In addition to discrete uninterruptable power source (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. They are all redundantly connected to multiple tier-1 transit providers. It should be noted that although traffic flowing across the private networks between Availability Zones in a single region is on AWS-controlled infrastructure, all communications between regions is across public Internet infrastructure, so appropriate encryption methods should be used to protect sensitive data. Data are not replicated between regions unless proactively done so by the customer.
Distinct physical locationsLow-latency network connections between AzsIndependent power, cooling, network, securityAlways partition app stacks across 2 or more AzsElastic Load Balance across instances in multiple AzsDon’t confuse AZ’s with Regions!
Note, the question is not “do you need to automate your deployment” or “should I use automation when I’m using the cloud?” the answer to that is YES!The question is; if you’re using fully standard PHP or Java stacks, why manage it? Beanstalk does that great, with zero lock-in. If what you need is more complex, perhaps cloudformation (note, you can do BOTH!)
Three-Tier Web App has been “fork-lifted” to the cloudEverything in a single Availability ZoneLoad balanced at the Web tier and App tier using software load balancersMaster and Standby databaseElastic IP on front end load balancer onlyS3 used as DB backup instead of tapeHow can you use AWS features to make this app more highly available?
Three-Tier Web App has been “fork-lifted” to the cloudEverything in a single Availability ZoneLoad balanced at the Web tier and App tier using software load balancersMaster and Standby databaseElastic IP on front end load balancer onlyS3 used as DB backup instead of tapeHow can you use AWS features to make this app more highly available?
Avoid single points of failureAssume everything fails, and design backwardsGoal: Applications should continue to function even if the underlying physical hardware fails or is removed or replaced.Design your recovery processTrade off business needs vs. cost of high-availability
Multiple DNS TargetsLoad Balanced across Availability ZonesAuto-scaled web-cache servers with health checksAuto-scaled web-servers with health checksComprehensive config, data, and AMI backupMonitoring, alarming and logging
DB-Tier Load Balancing or QueueingAuto-scaled Database cache servers with health checksRedundant Relational Database systems Mirrored, log-shipped, async or sync replicatedDesigned to scale horizontally (sharding)Durable NoSQL or KV-store Data SystemsNo SPOF designSupports automatic re-balancing, replication, and fault-recoveryMonitoring, alarming and logging
DB-Tier Load Balancing or QueueingAuto-scaled Database cache servers with health checksRedundant Relational Database systems Mirrored, log-shipped, async or sync replicatedDesigned to scale horizontally (sharding)Durable NoSQL or KV-store Data SystemsNo SPOF designSupports automatic re-balancing, replication, and fault-recoveryMonitoring, alarming and logging
Multi-AZ DeploymentsSynchronous replication across AZsAutomatic fail-over to standby replicaAutomated BackupsEnables point-in-time recovery of the DB instanceRetention period configurableSnapshotsUser initiated full backup of DBNew DB can be created from snapshots