"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital.
Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate.
Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.
16. AWS OUTAGE =YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
Thursday 22 August 13
17. RESILIENCE @ SCALE
Embrace Failure: Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://blabitcanada.com/category/twitter-2/
Thursday 22 August 13
18. DEFINE 'AVAILABLE'
Things will break, so choose your degraded state.
http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris
Thursday 22 August 13
19. BASIC API CALL
3 potential points of failure
Thursday 22 August 13
20. FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
21. FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
22. FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
23. FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
24. FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
27. MANY SMALL NODESVERSUS
A FEW LARGER NODES
The benefits of the many outweigh the benefits of the few
http://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/
Thursday 22 August 13
28. DATABASES
CAPTheorem applies.
Your choice: sacrifice availability or consistency. Orange is a lie.
RDBMS
BigTable Based
Master / Slave based
CouchDB
Dynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html
Thursday 22 August 13
29. SIMPLE STORAGE SERVICE
S3:Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
http://aws.amazon.com/s3/
https://forums.aws.amazon.com/message.jspa?messageID=182919#182919
Thursday 22 August 13
30. CACHE WHATYOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://cruncht.com/95/drupal-caching/
http://redis.io/
http://memcached.org/
http://varnish-cache.org/
Thursday 22 August 13
31. CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/
Thursday 22 August 13
32. USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
Thursday 22 August 13
33. USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure
Thursday 22 August 13
34. SHOUT OUT: DYN
DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc
http://dyn.com
Thursday 22 August 13
35. USE IAM ROLES FOR ACCESS
Humans make mistakes, including your humans
Thursday 22 August 13
36. COST @ SCALE
Scaling without breaking the bank
http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg
Thursday 22 August 13
37. EMR + SPOT INSTANCES
On demand rate: $0.165 / hour
http://aws.amazon.com/ec2/spot-instances/
Thursday 22 August 13
38. AMAZON REDSHIFT
Economical Business Intelligence
Scales with data size
http://www.flitemedia.com/music.php
http://aws.amazon.com/redshift
http://www.tableausoftware.com/
Thursday 22 August 13
39. AMAZON GLACIER
"Tapes for the Cloud Era"
Writes vastly cheaper than reads
http://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html
Thursday 22 August 13
40. AWS SIMPLE EMAIL SERVICE
Dealing with email is boring and time consuming
http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/
Thursday 22 August 13
41. AWS SIMPLE QUEUE SERVICE
Excellent for latency insensitive, small volume queues
http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.html
http://aws.amazon.com/sqs/
http://colby.id.au/benchmarking-sqs
Thursday 22 August 13
43. AWS DYNAMO DB
Excellent for small keys & high read rates
at known & consistent IOPS
http://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/
Thursday 22 August 13
44. MAXIMIZE IOPS
RAID 0 Ephemeral drives
use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS
http://calculator.s3.amazonaws.com/calc5.html
http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance
Thursday 22 August 13
45. RED FLAGS
Anti-patterns to watch out for
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/
Thursday 22 August 13
46. PROVISIONED IOPS EBS
Ephemeral storage on c1/m1.xlarge or SSD is better
If you must: m*large or c1.xlarge for dedicated NIC
http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nn
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html
Thursday 22 August 13
47. AWS DYNAMO DB
For high write rates or
large/variable keys
http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx
Thursday 22 August 13
48. HIGH IO/DISK/RAM NODES
Use them deliberately
http://elledecoration.co.za/2010/07/gigantic/
Thursday 22 August 13
49. AWS CLOUDWATCH
Metric collection,Amazon style
Cost prohibitive & resolution too low
http://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/
Thursday 22 August 13
50. LOWER COST PER METRIC
Use graphite & statsd
http://graphite.wikidot.com/
https://github.com/etsy/statsd
Thursday 22 August 13
51. HOSTED ALTERNATIVES
Circonus:All the insights you ever wanted
StackDriver: Optimized for AWS
http://circonus.com
http://stackdriver.com
Thursday 22 August 13
52. AWS CLOUDFORMATION
Templatize your entire stack
Harder to use as complexity increases
http://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg
Thursday 22 August 13
53. RDS FOR ANALYTICS/REPORTS
Paying OLTP prices for BI usage
Sharding will be a matter of time
http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/
Thursday 22 August 13