Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Five Years of EC2 Distilled
1. Five years of EC2
distilled
Grig Gheorghiu
Silicon Valley Cloud Computing Meetup, Feb. 19th 2013
@griggheo
agiletesting.blogspot.com
2. whoami
• Dir of Technology at Reliam (managed
hosting)
• Sr Sys Architect at OpenX
• VP Technical Ops at Evite
• VP Technical Ops at Nasty Gal
3. EC2 creds
• Started with personal m1.small instance in
2008
• Still around!
• UPTIME:
• 5:13:52 up 438 days, 23:33, 1 user, load average:
0.03, 0.09, 0.08
4. EC2 at OpenX
• end of 2008
• 100s then 1000s of instances
• one of largest AWS customers at the time
• NAMING is very important
• terminated DB server by mistake
• in ideal world naming doesn’t matter
5. EC2 at OpenX (cont.)
• Failures are very frequent at scale
• Forced to architect for failure and
horizontal scaling
• Hard to scale at all layers at the same time
(scaling app server layer can overwhelm DB
layer; play wack-a-mole)
• Elasticity: easier to scale out than scale back
6. EC2 at OpenX (cont.)
• Automation and configuration management
become critical
• Used little-known tool - ‘slack’
• Rolled own EC2 management tool in
Python, wrapped around EC2 Java API
• Testing deployments is critical (one
mistake can get propagated everywhere)
7. EC2 at OpenX (cont.)
• Hard to scale at the DB layer (MySQL)
• mysql-proxy for r/w split
• slaves behind HAProxy for reads
• HAProxy for LB, then ELB
• ELB melted initially, had to be gradually
warmed up
8. EC2 at Evite
• Sharded MySQL at DB layer; application
very write-intensive
• Didn’t do proper capacity planning/dark
launching; had to move quickly from data
center to EC2 to scale horizontally
• Engaged Percona at the same time
9. EC2 at Evite (cont.)
• Started with EBS volumes (separate for
data, transaction logs, temp files)
• EBS horror stories
• CPU Wait up to 100%, instances AWOL
• I/O very inconsistent, unpredictable
• Striped EBS volumes in RAID0 helps with
performance but not with reliability
10. EC2 at Evite (cont.)
• EBS apocalypse in April 2011
• Hit us even with masters and slaves in diff.
availability zones (but all in single region -
mistake!)
• IMPORTANT: rebuilding redundancy into your
system is HARD
• For DB servers, reloading data on new server is
a lengthy process
11. EC2 at Evite (cont.)
• General operation: very frequent failures
(once a week); nightmare for pager duty
• Got very good at disaster recovery!
• Failover of master to slave
• Rebuilding of slave from master (xtrabackup)
• Local disks striped in RAID0 better than
EBS
12. EC2 at Evite (cont.)
• Ended up moving DB servers back to data
center
• Bare metal (Dell C2100, 144 GB RAM,
RAID10); 2 MySQL instances per server
• Lots of tuning help from Percona
• BUT: EC2 was great for capacity planning!
(Zynga does the same)
13. EC2 at Evite (cont.)
• Relational databases are not ready for the
cloud (reliability, I/O performance)
• Still keep MySQL slaves in EC2 for DR
• Ryan Macktechnologies so“Wecould better
understood
(Facebook):
we
chose well-
predict capacity needs and rely on our existing
monitoring and operational tool kits."
14. EC2 at Evite (cont.)
• Didn’t use provisioned IOPS for EBS
• Didn’t use VPC
• Great experience with Elastic Map Reduce,
S3, Route 53 DNS
• Not so great experience with DynamoDB
• ELB OK but still need HAProxy behind it
15. EC2 at NastyGal
• VPC - really good idea!
• Extension of data center infrastructure
• Currently using it for dev/staging + some
internal backend production
• Challenging to set up VPN tunnels to
various firewall vendors (Cisco, Fortinet)
- not much debugging on VPC side
16. Interacting with AWS
• AWS API (mostly Java based, but also Ruby
and Python)
• Multi-cloud libraries: jclouds (Java), libcloud
(Python), deltacloud (Ruby)
• Chef knife
• Vagrant EC2 provider
• Roll your own
17. Proper infrastructure care
and feeding
• Monitoring - alerting, logging, graphing
• It’s not in production if it’s not monitored
and graphed
• Monitoring is for ops what testing is for
dev
• Great way to learn a new infrastructure
• Dev and ops on pager
18. Proper infrastructure care
and feeding
• Going from #monitoringsucks to
#monitoringlove and @monitorama
• Modern monitoring/graphing/logging tools
• Sensu, Graphite, Boundary, Server
Density, New Relic, Papertrail, Pingdom,
Dead Man’s Snitch
19. Proper infrastructure care
and feeding
• Dashboards!
• Mission Control page with graphs based on
Graphite and Google Visualization API
• Correlate spikes and dips in graphs with errors
(external and internal monitoring)
• Akamai HTTP 500 alerts correlated with Web
server 500 errors and DB server I/O wait
increase
20. Proper infrastructure care
and feeding
• HTTP 500 errors as a percentage of all HTTP
requests across all app servers in the last 60
minutes
21. Proper infrastructure care
and feeding
• Expect failures and recover quickly
• Capacity planning
• Dark launching
• Measure baselines
• Correlate external symptoms (HTTP 500) with
metrics (CPU I/O Wait) then keep metrics
under certain thresholds by adding resources
22. Proper infrastructure care
and feeding
• Automate, automate, automate! - Chef, Puppet,
CFEngine, Jenkins, Capistrano, Fabric
• Chef - can be single source of truth for
infrastructure
• Running chef-client continuously on nodes
requires discipline
• Logging into remote node is anti-pattern (hard!)
23. Proper infrastructure care
and feeding
• Chef best practices
• Use knife - no snowflakes!
• Deploy new nodes, don’t do massive updates
in place
• BUT! beware of OS monoculture
• kernel bug after 200+ days
• leapocalypse
24. Is the cloud worth the
hype?
• It’s a game changer, but it’s not magical; try before
you buy! (benchmarks could surprise you)
• Cloud expert? Carry pager or STFU
• Forces you to think about failure recovery,
horizontal scalability, automation
• Something to be said about abstracting away the
physical network - the most obscure bugs are
network-related (ARP caching, routing tables)
25. So...when should I use
the cloud?
• Great for dev/staging/testing
• Great for layers of infrastructure that
contain many identical nodes and that are
forgiving of node failures (web farms,
Hadoop nodes, distributed databases)
• Not great for ‘snowflake’-type systems
• Not great for RDBMS (esp. write-intensive)
26. If you still want to use
the cloud
• Watch that monthly bill!
• Use multiple cloud vendors
• Design your infrastructure to scale horizontally
and to be portable across cloud vendors
• Shared nothing
• No SAN, NAS
27. If you still want to use
the cloud
• Don’t get locked into vendor-proprietary
services
• EC2, S3, Route 53, EMR are OK
• Data stores are not OK (DynamoDB)
• OpsWorks - debatable (based on Chef, but still
locks you in)
• Wrap services in your own RESTful endpoints
28. Does EC2 have rivals?
• No (or at least not yet)
• Anybody use GCE?
• Other public clouds are either toys or
smaller, with less features (no names named)
• Perception matters - not a contender unless
featured on High Scalability blog
• APIs matter less (can use multi-cloud libs)
29. Does EC2 have rivals?
• OpenStack, CloudStack, Eucalyptus all seem
promising
• Good approach: private infrastructure (bare
metal, private cloud) for performance/
reliability + extension into public cloud for
elasticity/agility (EC2 VPC, Rack Connect)
• How about PaaS?
• Personally: too hard to relinquish control