2024: Domino Containers - The Next Step. News from the Domino Container commu...
AWS Sydney Summit 2013 - Architecting for High Availability
1. Joseph Ziegler
Architecting for High Availability
AWS Technical Evangelist @jiyosub
Alexander Courtis
Solutions Architect
SilverQuest Consulting
Guest presenter:
2. High Availability Principles
Design for reliable, affordable, fault-tolerant systems
that operate with a minimal amount of human
interaction from day one
3. Agenda
• Objective
– Review services and approaches to build a highly available architecture on AWS
• Sections
– High Availability Overview
– Relevant AWS Features and Services
– Principles in Practice
• Customer Case Study
– Carsguide
4. Agenda
• Objective
– Review services and approaches to build a highly available architecture on AWS
• Sections
– High Availability Overview
– Relevant AWS Features and Services
– Principles in Practice
• Customer Case Study
– Carsguide
5. 55
What is High Availability (HA)?
• Availability: Percentage of time an application operates during its work cycle.
• Loss of availability is known as an outage or downtime.
– App is offline, unreachable or partially available.
– App is slow to use.
– Planned and unplanned.
• Goal
– No downtime.
– Always available.
6. 66
HA is related to …
• Scalability
– Ability of a application to accommodate growth without changing design.
– If app cannot scale, then availability will be impacted.
– Scalability doesn’t guarantee availability.
• Fault Tolerance
– Built-in redundancy so apps can continue functioning when components fail.
– FT is crucial to HA.
• Disaster Recovery
– The process, policies and procedures related to restoring service after a catastrophic
event.
7. 77
Automation
• “Everything is an API” philosophy enables automation of AWS resources.
• AWS is literally a programmable data center.
• Provisioning resources is a web service call away.
• Many different ways to automate:
– AWS CloudFormation
– Numerous SDKs: Java, .NET, Python, Ruby, PHP
– Command line tools
• Automation is one of the key differentiators between AWS and traditional
infrastructure.
• Automation assists with HA.
8. Agenda
• Objective
– Review services and approaches to build a highly available architecture on AWS
• Sections
– High Availability Overview
– Relevant AWS Features and Services
– Principles in Practice
• Customer Case Study
– Carsguide
10. US-WEST (Oregon)
EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (N. California)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC (Sydney)
11. US-WEST (Oregon))
EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (N. California)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC (Sydney)
12. AWS BUILDING BLOCKS
Inherently Highly Available and
Fault Tolerant Services
Highly Available with
the right architecture
Amazon S3
Amazon DynamoDB
Amazon CloudFront
Amazon Route53
Elastic Load Balancing
Amazon SQS
Amazon SNS
Amazon SES
Amazon SWF
…
Amazon EC2
Amazon EBS
Amazon RDS
Amazon VPC
13. 1313
Relevant Features of AWS
• Leverage FT services whenever possible.
• Use multiple AZs
• Use abstract machine and system representations
– Build images from recipes, stacks from CloudFormation
• Implement elasticity
– Bootstrapping, load balancing, Auto Scaling, etc…
– Instance asks: “Who am I and what is my role?”
14. Agenda
• Objective
– Review services and approaches to build a highly available architecture on AWS
• Sections
– High Availability Overview
– Relevant AWS Features and Services
– Principles in Practice
• Customer Case Study
– Carsguide
15. Principles of HA
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
90. DECIDERS
COORDINATION LOGIC
1. Poll for work on a decision list
Long polling: 60 seconds
2. Evaluate workflow execution history
SWF sends full history in JSON format
3. Return decision to Amazon SWF
Usually scheduling another task
91. Workers
COORDINATION LOGIC
1. Poll for work on a specific task list
Long polling: 60 seconds
2. Execute works, send heartbeats
SWF sends input data from deciders
3. Return success / failure
Detailed data can be provided to deciders
92.
93.
94.
95. NO NEW LANGUAGE
TO LEARN
YOUR CODE IS YOUR WORKFLOW LANGUAGE
SWF MAINTAINS STATE
102. Agenda
• Objective
– Review services and approaches to build a highly available architecture on AWS
• Sections
– High Availability Overview
– Relevant AWS Features and Services
– Principles in Practice
• Customer Case Study
– Carsguide
108. 108108
Alex On Software Engineering: Principle #4
• The Best Developers Are The Laziest
• Avoid Inventing Octagonal Wheels
• Work Very Hard Avoiding Future Work
– Automate Testing
– Production Requires Little To No Maintenance
• Break Into Small, Independent Chunks
109. carsguide.com.au – Lead Tracker
• Requirements
• Architecture
• Development Approach
• Technologies
112. carsguide.com.au – Lead Tracker
• Requirements
• Architecture
• Development Approach
• Technologies
113. Development
• Don’t Start With SWF
• Build Stateless, Standalone Services
• Unit / Integration Test Services
• Wrap Services As SWF Workers
• Build SWF Deciders For Repeatable Workflows
• Build A Single “Master” Decider
114. Artifacts
• 2 Artifacts
– Client JAR, used by external application servers to start the process
– Master JAR, containing SWF deciders/workers and services
• Why have a single Master JAR?
– To make bootstrapping as simple as possible: each server instance is identical, you
just select a “flavour” i.e. Decider or Worker
115. carsguide.com.au – Lead Tracker
• Requirements
• Architecture
• Development Approach
• Technologies
117. 117117
Lead Persistence
• Well Structured, Fixed Schema Data
• Transactional
– Relational Database
Spring Data JPAAmazon RDS
+
118. 118118
Audit Persistence
• Important
• Variable Format, Unstructured Data
• Write Often, Read Rarely
– NoSQL
– Document Data Store
+
Spring Data
Amazon DynamoDB
119. 119119
Invoking SWF
• SWF is invoked via a simple JSON web service call
– Roll your own
– Java SDK client
• Suit yourself
• We used the Java SDK client
121. Worker Example
@Activities(version = "1.0")
@ActivityRegistrationOptions(
defaultTaskHeartbeatTimeoutSeconds = FlowConstants.NONE,
defaultTaskScheduleToCloseTimeoutSeconds = 180,
defaultTaskScheduleToStartTimeoutSeconds = 60,
defaultTaskStartToCloseTimeoutSeconds = 60
)
public interface MyFancyActivities {
/**
* Post something that is worthy
*
* @param wowFancy mandatory; must be fancy
* @return populated log indicating success or failure
*/
FancyLog postFancy(FancyThing wowFancy);
...
122. 122122
Deciders
• No GUI or unmanageable “code”
• Synchronous code, using Promises
• Orchestrates workers and other decider workflows
• Executes many times
– Stateless
123. public class RogerDeciderImpl {
...
@Override
public void decide(final Stuff bigStuff) {
Promise<StanDecision> stan = stanClient.decide(bigStuff);
Promise<FranDecision> fran = franClient.decide(bigStuff);
Promise<EarthDestroDecision> decision = rogerClient.decide(stan, fran);
klausClient.audit(decision);
mothershipClient.blowUp(decision);
}
Decider Implementation Example
124. 124124
Deployment
• EC2 instances managed via Puppet
• Apache Maven does everything from source code management to running the processes
• Is there a better way to bootstrap?
+
Amazon Elastic
Beanstalk
pom.xml
Alex’s Amazing
Elastic Mavenstalk™
=
HA means different things to different people so let’s agree on some fundamental definitions.HA is also implemented differently based on app architecture and workload.Does HA mean that the app is simply alive or reachable? Or that is servicing requests within an acceptable level of performance.Typically higher HA % means more cost.The higher the level of HA, the less likely human intervention is possible.http://en.wikipedia.org/wiki/High_availability
Scalability is important to availability. If an application cannot handle growth, then it will be overwhelmed and will affect availability. But a scalable app doesn’t guarantee HA.
Monitoring typically uses a combination of systems:
1) SQS Buffers building up2) Launching transcoding3) Overprovisioning to catch up4) Back to normal
1) SQS Buffers building up2) Launching transcoding3) Overprovisioning to catch up4) Back to normal
1) SQS Buffers building up2) Launching transcoding3) Overprovisioning to catch up4) Back to normal
1) SQS Buffers building up2) Launching transcoding3) Overprovisioning to catch up4) Back to normal
Writing a decider requires you to review the state of the workflow. The decider itself is stateless but SWF keeps the state and tells the decider about what has happened.[Point out that a decider can return several decisions in the same call. This allows for parallel processing.]To write workers and deciders you can use the SWF SDK (provided for Java, .NET, PHP) or call the API directly, but to make this easier [CUE NEXT SLIDE]
Writing a decider requires you to review the state of the workflow. The decider itself is stateless but SWF keeps the state and tells the decider about what has happened.[Point out that a decider can return several decisions in the same call. This allows for parallel processing.]To write workers and deciders you can use the SWF SDK (provided for Java, .NET, PHP) or call the API directly, but to make this easier [CUE NEXT SLIDE]