Building a cloud service on a cloud infrastructure. Also, cloud.
1. Building a cloud service on a cloud infrastructure at
Also, cloud.
Mikhail Panchenko, Surge 2011
2. Who Am I?
Pancakes
Infrastructure Engineer at SimpleGeo
Backend Engineer at Flickr before that
Backend and Frontend Engineer at Yahoo!
Ops/Tools before that
Philosophy, Economics, and French major
before that
@mihasya
pancakes@simplegeo.com
3. Tools for mobile/geo developers
Primarily focused on services, some data-
oriented APIs
PaaS, I guess? I've lost track a bit
Availability, redundancy part of brand
Our outage = your outage
No pressure
4. Agenda
Goals
A little bit of theory
Challenges in The Cloud
General Architecture
Implementation Details
11. "Complex interactions are those of unfamiliar
sequences, or unplanned and unexpected
sequences, and either not visible or not
immediately comprehensible."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 78). Kindle Edition.
12. "The notion of baffling interactions is increasingly
familiar to all of us. [...] As systems grow in size and
in the number of diverse functions they serve, and
are built to function in ever more hostile
environments, increasing their ties to other systems,
they experience more and more incomprehensible
or unexpected interactions. They become more
vulnerable to unavoidable system accidents."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
18. Three Mile Island
"... they found that radioactive water was not
traveling to the tank they intended, but because of
complex flow and pressure interactions, was going
to a different, wrong tank, which also overflowed,
this time in the auxiliary building."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (pp. 22-23). Kindle Edition.
19. Amazon Web Services
"The traffic shift was executed incorrectly and
rather than routing the traffic to the other router on
the primary network, the traffic was routed onto the
lower capacity redundant EBS network."
"Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region"
http://aws.amazon.com/message/65648/
20. Common Theme
Previously independent systems become
coupled as a result of unanticipated
interactions, leading to fundamentally
surprising results
31. "The notion of baffling interactions is increasingly
familiar to all of us. [...] As systems grow in size and
in the number of diverse functions they serve, and
are built to function in ever more hostile
environments, increasing their ties to other
systems, they experience more and more
incomprehensible or unexpected interactions. They
become more vulnerable to unavoidable system
accidents."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
34. Decouple Your Subsystems
Shared resources are the most common
source of unexpected interaction
Resist temptation to double up on roles
Use queues, caches as buffers
NOTE: those are complex
subsystems of their own
35. Decouple Your Subsystems
Explicit Decoupling
CPU Affinity
Webserver on 1-7; SSH etc on 8
Crude, but gets the job done
More robust solutions - containers
36. Decouple Your Functionality
Service architecture
Each service does one thing well
Easier to measure, understand, and
accommodate resource demands
Reduce potential for interactions,
cross-functional failure
37. Decouple from Your Environment with Configuration
Management
Decouple from your platform (OS/kernel)
Easy to test/bench potential candidates
Easy to migrate if you find a winner
This is especially important when dealing with cloud
Automate as much of deploy/bootstrap
process as possible
Probably won't help much during a provider outage
due to stampede
BUT: DirectConnect
You might not always be in the cloud..
38. Decouple Your Datacenters
Most robust redundancy mechanism
Hot-hot keeps you on your toes
Simplifies, not just for the cloud
Yahoo! now foregoing datacenter
features like HVAC
"If it gets too hot in Washington,
turn that DC off for a while"
I'm sure they're not the only ones
39. Decouple Your Datacenters
"AZ" - Basic building block for EC2
This is the level they (theoretically)
decouple at
They are probably thinking along the
same lines we are - must be able to turn
off one AZ without impact in the other
64. Services - Pick Your Own Adventure
Node.js and Python
Some people just hate Node.js
Can be anything, as long as Gate can
talk to it
( another reason to decouple )
Highly specialized
65. RabbitMQ
A grenade for our knife-fight
Very flexible - more than we need
Simplification candidate
New persistor in >= 1.3 - degradation
over failure
See talk at 1:30PM