1. AWS 201 : Breakout Track Singapore
“Design for Failure”
HA and DR Best practices
Harish Ganesan
Co founder & CTO
8KMiles
www.twitter.com/harish11g
http://www.linkedin.com/in/harishganesan
2. Agenda
• Explain HA Architecture with Real Customer
Case
• Understand how to Architect a web app in AWS
with
– Highly Availability
– DR
– Scalability
• Why AWS ?
3. About the Customer
• Online ecommerce company
• NASDAQ Listed
• Application consumed by Online users , Mobile
and Web Services
4. Requirements
• High Availability on all tiers with No SPOF
• Auto Scalable and elastic infrastructure
• Ability to serve millions of requests per day
• Serve peak HTTP traffic of 8000+ reqs/sec
• Serve peak HTTPS traffic of 2500+ reqs/sec
• 65% of the business is done during holiday , so
no downtime is affordable
• Monitoring , Backup and deployment ease
• Optimal DR setup ( Cost vs RTO/RPO)
5. Technology and Tiers
• Multi tiered Linux, Apache, Java Web site on
AWS
• Data base tier using MySQL
• Cache Tier
• Integration tier with Queues and Background
programs
• HTTP and HTTPS protocol
6. What 8KMiles did ?
• Consulting : Architected the entire website infra
on AWS
• Implementation:
– Configured the Infra on AWS
– Developed custom DevOps scripts on AWS
• Supported during the Thanksgiving and Holiday
• Cloud Development Partner :
– Currently Reengineering the customer App to
leverage more AWS services
8. A simple LAMJ Architecture
1 Web/App Server interacts
US-EAST-1a
with MySQL for Queries
AWS Security Groups and Transactions
Integration
Web/App/Cache
Services
Server
MySQL DB
CloudWatch
10. A simple LAMJ Architecture
Single Point of Failure at
US-EAST-1a
multiple tiers
AWS Security Groups
Integration
Web/App/Cache
Services
Server
MySQL DB
CloudWatch
Not a Highly Available Architecture
11. How to avoid SPOF and build a robust
architecture ?
12. Step 1: Distribute the Application to
Multiple Tiers
1 Separate out the
US-EAST-1a
individual tiers into
AWS Security Groups separate EC2 instances
Integration
Web/App Server
Service tier
MySQL DB
CloudWatch
13. Step 2: Add Multiple Servers in each layer
1 Add Multiple EC2
US-EAST-1a
instances in every tier
AWS Security Groups
Integration
Web/App Server
Service tier
MySQL DB
CloudWatch
16. Why AWS ELB ?
• AWS ELB provides load balancing service with
thousands of EC2 servers behind them
• AWS ELB will automatically Scale up /down
the load balancing servers in backend
• The theoretical maximum response rate of
AWS ELB is limitless
• It can handle 20000+ concurrent requests
easily (RightScale Benchmark)
• AWS ELB works seamlessly with AWS Auto
Scaling
17. Why AWS ELB ?
• AWS ELB is integrated well with other AWS
• No maintenance
• Pay as you go
18. Load balancing Layer
Online / Web / Mobile 1 Simple Round Robin
Algorithm
AWS Elastic Load balancer
US-EAST-1a
AWS Security Groups 2 Health Checks , SSL
termination
3 ELB is a Highly Available
Web/App Server
Service with No SPOF
MySQL DB
20. High Availability @ Web/App tier
1 Add AWS Auto Scaling to
Web / App tier
AWS Elastic Load balancer
US-EAST-1a
AWS Security Groups 2 Tie AWS Auto Scaling with
Web/App Server AWS ELB
S3 Puppet
Auto Scaling 3 Deploy the app using
Puppet
Integration
Service Tier
MySQL DB
21. Designing HA @ Web/App Tier
• AWS Auto Scaling will manage un Healthy EC2
instances
• AWS Auto Scaling will ensure minimum
number Web/App EC2 instances are always
running
• In event of failure , new instances will be
launched between 30-120 seconds
automatically
• ELB traffic is seamlessly attached to the Auto
Scaled EC2 instances
22. Designing HA @ Web/App Tier
• Deploy the application / patches in Auto Scaling
environment using Puppet / S3 scripts
• Choose the right EC2 instance Type
– Large ( Less CPU intensive , HEAP 5.5 GB RAM )
– High CPU Extra Large ( More CPU intensive , HEAP 5.5
GB RAM , Concurrent GC)
• Points to remember
– Do not store the Session in-memory of web/app server
– Rotate and move the log files to S3 periodically
– Move the Uploaded data files , images to S3 or
GlusterFS
23. What happens when US-EAST-1a AZ fails ?
Solution : Leverage AWS Multi-AZ architecture
25. 1 Infrastructure is spread across
HTTP/S requests hit the Amazon Load Balancer
from the browser or mobile devices
Multi AZ’s of AWS inside a
Region
AWS Elastic Load balancer
AZ: US-EAST-1a AZ: US-EAST-1b
AWS Security Groups 2 AWS Elastic Load balancer
Web/App EC2 Web/App EC2 directs requests to EC2
instances across Multiple AZ’s
Auto Scaling Auto Scaling
3 Amazon AutoScaling
automatically launches new
EC2 instances
across Multiple AZ’s
4 No Code Changes required to
leverage Multi-AZ
26. High Availability @ Web/App/DEX layer
• AZ’s are connected by Low Latency network
• AZ’s are insulated from failures in other
Availability Zones *
• AWS Auto Scaling can manage EC2 instances
across AZ’s
• AWS ELB can direct load to EC2 instances
across AZ’s
• AWS CloudWatch can monitor the EC2
instance availability across AZ
28. Database Tier
• Options
– MySQL Master- Slave replication
– MySQL ndbCluster
– RDS MySQL Master – Standby
– RDS MySQL Master – Standby + Read Replica’s
29. High Availability @ DB Layer
1 Read Replica’s launched
in Multiple AZ’s for HA
AWS Elastic Load Balancer
USA- EAST -1A USA- EAST -1B
AWS Security groups
2 RDS Standby will be
launched on different AZ
from the RDS master for
Web/App EC2 Web/App EC2
HA
Auto Scaling Auto Scaling
3 Web/APP hosted on
Amazon EC2 will transact
S3
Read Read
with RDS master and
Replica Replica read from Read replica’s
RDS RDS
Master Standby
D
CloudWatch
30. High Availability @ DB Layer
• RDS Master and RDS Standby in Multiple AZ
for HA
• Read Replica’s in Multiple AZ for HA
• Offers No SPOF on AZ level
• Read Replica’s can be launched/terminated
without affecting the RDS Master availability
• In event of RDS master failure, RDS Standby
will be automatically promoted
• Promotion <180 seconds and no changes in
the application
31. High Availability @ DB Layer
• DB snapshots and MySQL Dumps facility
available
• Automatic full backups at configured
maintenance windows
• Point in time recovery till last minute
• Recovery might require App layer
configuration changes
32. High Availability @ DB Layer
• Points to remember
– RDS supports only MySQL innodb engine
– Give more memory to RDS Master
• Use Extra Large or High Memory instance types
– Keep your Read Replica’s and RDS Master with
same size
– Multiple Read Replica’s can be Load Balanced
using HAProxy LB
34. Use AWS Building blocks
• AWS Building blocks are in built with
– Inherent fault tolerance
– HA and scalability
• Following Building blocks were used
– S3 , CloudFront , Route 53 , CloudWatch , SNS ,
SQS , SES , ELB , EIP , EBS
35. Application Architecture in AWS
Browser / Web Services /
Mobile
Route 53
AWS CloudFront
Elastic Load balancer CDN
AZ: US-EAST-1a AZ: US-EAST-1b
AWS Simple
AWS Security Groups Email Service
Amazon EC2 Servers Amazon EC2 Servers C
L
O
U
Auto Scaling Auto Scaling
D
W
ElastiCache
A
T AWS Simple
S3 C
Notification Service
(Alerts)
Read Slave Read Slave
1 2 H
DB Master DB Standby
Puppet SQS
36. How it is used in the Project ?
• ELB – Load Balancing
• Route 53 – DNS mappings , Algo- RR
• CloudFront - Assets , HTML , CSS , JS , Images
• S3 – Logs , Snapshots , Images
• CloudWatch – Monitor the CPU , ELB , RDS ,
Custom metrics
• SNS – System Alerts
• SES – Emails ( Password , activation , app alerts )
• EBS – EBS backed AMI for Web/app tier
• EIP – Elastic IP for Puppet server
37. What happens if the Entire AWS region is
affected ?
Solution : Design HA/DR across Regions
38. High Availability across AWS Regions
DR Web site is hosted in
AWS Tokyo
Main Web Site is hosted
in AWS Singapore region
39. DR / HA Options in AWS
No downtime Hot Active
In minutes Hot DR
> 1-2 hours Warm DR
> Few hours Cold DR
$ $$ $$$ $$$$
40. Cold DR
Passive
Active
AWS Tokyo
AWS Singapore Amazon
Route 53
ELB ELB
Web / App EC2 Web/App EC2 Web / App EC2
Web/App EC2
Database Layer
Database Layer
Master Standby Master Standby
Puppet
D
D
Sync DB Snaphsots /
Dumps every X hours
Sync
41. Cold DR
• When the primary is Down , entire Secondary site is
manually activated in Cold DR
• RTO > Few Hours to get the Secondary site up and
running
• RPO – Data loss is acceptable
• CloudFormation templates can be configured on
Primary and Secondary
• AMI’s , App and DB Data are synced periodically
42. Cold DR
• EIP Problem – Integration Services ( FTP ,
WebServices)
• Cost effective
• Most common
43. Warm DR
Passive
Active
AWS Tokyo
AWS Singapore
Amazon
Route 53 ELB
ELB
Web / App EC2 Web/App EC2 Web / App EC2
Web/App EC2
Database Layer Database Layer
Master Standby Master Standby
Puppet Puppet
D
D
Asynchronous Replication of databases between AWS regions
Sync
44. Warm DR
• When the primary is Down , Secondary site is
manually activated in Warm DR
• RTO > 1 hours to get the Secondary site up and
running
• RPO – minimal Data loss is acceptable
• CloudFormation templates can be configured on
Primary and Secondary site
• DB Data are replicated using Asynchronously
• Only DB and Puppet Servers are ready and running
45. Warm DR
• AMI’s, Application Patches and deployments are
managed through Puppet
• EIP Problem – Integration Services ( FTP , Web
Services)
• Costlier than Cold DR
• Recommended in many use cases
46. Hot DR
Passive
Active
AWS Singapore AWS Tokyo
Amazon
ELB Route 53 ELB
Web/App EC2 Web / App EC2 Web/App EC2 Web / App EC2
Database Layer Database Layer
Master Standby Master Standby
Puppet Puppet
D
D
Asynchronous Replication of databases between AWS regions
Sync
47. Hot DR
• When the primary is Down , Secondary site is
activated in Hot DR
• RTO > few minutes to get the Secondary site up and
running
• RPO – very minimal Data loss is acceptable
• CloudFormation templates can be configured on
Primary and Secondary site
• All the tiers are in ready and running state in
secondary but not active with live transactions
48. Hot DR
• DB Data are replicated using Asynchronously
• AMI’s, Application Patches and deployments are
managed through Puppet
• EIP Problem – Integration Services ( FTP , Web
Services)
• Costlier than Warm DR
• Rare usage
49. Hot Active
Directional DNS / Traffic
Active Active
AWS Singapore AWS Tokyo
Amazon
ELB Route 53 ELB
Web/App EC2 Web / App EC2 Web/App EC2 Web / App EC2
Database Layer Database Layer
Master Standby Master Standby
Puppet Puppet
D
D
2- way Asynchronous Replication of databases between AWS regions
Sync
50. Hot Active-Active
• Both primary and Secondary site are active
• RTO > few seconds to direct the traffic from
primary to Secondary site
• RPO – negligible Data loss
• Managed DNS server will provide automatic
failover at DNS level in case of a outage at the
primary website location
• Transparent switch between websites hosted in
AWS Singapore and AWS Tokyo within <30-60
seconds during outage
51. Hot Active-Active
• Automatic Traffic diversion to nearest site location
• Managed/Directional DNS servers are globally
distributed and Highly Available Service
• Persistent Data are replicated using Asynchronously
(2-way)
• AMI’s, Application Patches and deployments are
managed through Distributed Puppet
• EIP Problem – Integration Services ( FTP , Web
Services)
• Use case specific
52. Hot Active-Active
• Website deployed in both regions can scale and
shrink according to load
• Cost effective for large server farm deployments
• Low latency achieved through traffic direction
• No customers are lost because of load or
availability problems . Ops are happy !!!
53. Hot Active-Active
• Technically complex and intricate setup
• Costlier to build and operate (Sophistication
comes at a cost)
• No Unified Infra Management currently for this
architecture
– Example : Directional DNS Console
– AWS Console
– Puppet Console
54. Summary
• Understood how to Architect HA on AWS for LAMJ
website case
• Understood AWS Building blocks for HA and fault
tolerance
• How to achieve High Availability across AWS
Availability Zones (AZ’s) ?
• How to achieve High Availability across AWS
regions ?
55. If you need help in architecting High Availability
solutions on AWS?
56. Leave it to the experts , we will
handle this
Cloud Architecture Consulting
Cloud Application Development
Cloud Migration & Implementation
Cloud Adoption Strategy
“Let's get the job done”
57. Q&A
“All you need is an idea and the cloud will execute it for you.” (Structure 2010 event)
- Dr Werner Vogels , CTO of Amazon on 8KMiles
Contact :
cloud@8KMiles.com
harish@8KMiles.com
www.twitter.com/harish11g
http://www.linkedin.com/in/harishganesan