Infrastructure as code, automation, monitoring, disaster recovery, security, scaling and cost tracking are all subjects that are easily accessible but too often overlooked until it is already too late. In this session Cotap will share what AWS offers to help them stay ahead of the curve. By following 4 simple rules they will show how Cotap's Engineering team has been able to run for the past 12 months with over four nines of availability. They deploy 3 to 5 times a day, run in 2 regions/6 AZs and still manage to keep AWS costs below the monthly salary of an Engineer.
24. Monitoring & Alerting
● Cost of
o Interruptions
o Waking somebody up
● Channels
● Self-healing infrastructure
● External monitoring
● Page only when critical
25. Monitoring & Alerting
Situation Channel Page
Disk full 60% Chat, Email ✗
Disk full 90% Chat, Email, PagerDuty ✓
Chef not running for > 30m Chat, Email ✗
Redis not running for > 3 x 5s Chat, Email, PagerDuty ✓
ElasticSearch N-1 Chat, Email ✗
ElasticSearch N-2 Chat, Email, PagerDuty ✓
26. Monitoring & Alerting
● Cost of
o Interruptions
o Waking somebody up
● Channels
● Self-healing infrastructure
● External monitoring
● Page only when critical
28. Platform to fail
● Easy creation of temporary “Stacks”
● Branches can get their own hardware
● Clients can talk to a branch
● QA happens on Sandbox
● Exact copy of Production
● Scale up/down based on needs
● Different Region (us-east-1)
30. Platform to fail
● Easy creation of temporary “Stacks”
● Branches can get their own hardware
● Clients can talk to a branch
● QA happens on Sandbox
● Exact copy of Production
● Scale up/down based on needs
● Different Region (us-east-1)
31. Rule #3
All changes have to go through Sandbox.
45. Cost Control
● Tags
o Role
o Environment
● Cost explorer
● Threshold alerting
● Share monthly
● Export to CSV
● Right-Scale (ASG)
46. 4 rules of 5 nines.
● All changes have to be under VC
● No instance should be launched manually
● All changes are deployed to Sandbox first
● Production is just a more powerful Sandbox
A breakthrough came in April 1913. A production engineer in the flywheel magneto assembly area tried a new way to put this component's parts together. The operation was divided into 29 separate steps. Workers placed only one part in the assembly before pushing the flywheel down the line to the next employee.
Previously, it had taken one employee about 20 minutes to assemble a flywheel magneto. Divided among 29 men, the job took 13 minutes. It was eventually trimmed to five minutes. This approach was applied gradually to the construction of the engine and other parts.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own
Fine balance between productivity and keeping systems running. If you are constantly fixing your systems you are not shipping code.
Have rules and procedures in place to deploy code that won’t break your infrastructure.
Fine balance between productivity and keeping systems running. If you are constantly fixing your systems you are not shipping code.
Have rules and procedures in place to deploy code that won’t break your infrastructure.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own stacks (thanks to automation) and try their own stuff.
We also do bi-weekly catastrophe scenario, a manual chaos monkey if you will.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own stacks (thanks to automation) and try their own stuff.
We also do bi-weekly catastrophe scenario, a manual chaos monkey if you will.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own stacks (thanks to automation) and try their own stuff.
We also do bi-weekly catastrophe scenario, a manual chaos monkey if you will.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own stacks (thanks to automation) and try their own stuff.
We also do bi-weekly catastrophe scenario, a manual chaos monkey if you will.
Give people a platform to fail.
Code from the assembly line goes to Sandbox.
It is reviewed, tested and used internally.
Identify problems in Sandbox early on, before pushing them out to the public.
People can actually build their own stacks (thanks to automation) and try their own stuff.
We also do bi-weekly catastrophe scenario, a manual chaos monkey if you will.
If you have applied most of the previous rules, then easy scaling comes at a low cost on Amazon.
CloudFormation handles creating AutoScaling groups easily.
Here we can talk about routing traffic from on AZ to another on the day of launch because our instances in AZ1 had the wrong MTU.
If you have applied most of the previous rules, then easy scaling comes at a low cost on Amazon.
CloudFormation handles creating AutoScaling groups easily.
If you have applied most of the previous rules, then easy scaling comes at a low cost on Amazon.
CloudFormation handles creating AutoScaling groups easily.
4 types of scaling
Preemptive, you know traffic is coming (Press, conference etc.) How quickly can you scale up your applications.
Automatically, Scaling up and back down when necessary
Vertically, can you change instance type easily? With cloudformation we are able to rotate an entire cluster size without manual intervention
Horizontally: Add machines and grow your cluster