28. Dynamism
• Disintermediation
• Developers can freely experiment
This is what you are paying for
• Isolation
• Applications safely co-exist
• Utilization
• Best use of expensive resources
32. You are not that BIG
• LAMP can scale on generic architecture
• 2008 - Facebook has over 800 memcached servers, with 28 terabytes
of RAM
• 2010 - Github has 16 physical machines, 128 cores, 288 GB RAM
• Don’t design for A Million Users
• Ship early, Ship ugly, Ship often!
33. You are not that BIG
• LAMP can scale on generic architecture
• 2008 - Facebook has over 800 memcached servers, with 28 terabytes
of RAM
• 2010 - Github has 16 physical machines, 128 cores, 288 GB RAM
• Don’t design for A Million Users
• Ship early, Ship ugly, Ship often!
34. EC2 Design Principles
• Minimize management footprint
• Run inVMs just like customers.
• Forced to analyze what must run in
privileged space
• “Harden everything” means separate
network traffic inside the datacenter –
customers and management run there
• True multi-tenancy - Customers run side-
by-side
• Design by Fight Club
• "You are not a beautiful and unique
snowflake“
• “On a large enough time line, the survival
rate for everyone will drop to zero.”
http://www.flickr.com/photos/europedistrict/4058066840/
47. Nobody ever imagined a band of
Orcs would steal a database table
Charles Stross - Halting State
48. MTTF & MTTR
Understanding how, when and
why things fail is great ... but
http://www.flickr.com/photos/dierken/948171048/sizes/z/
49. MTTF & MTTR
Understanding how, when and
why things fail is great ... but
If your Mean Time to Recover exceeds the
time value of your data, your business is
DEAD
http://www.flickr.com/photos/dierken/948171048/sizes/z/
50. Testing
• Test with production-like dataset and
performance
• Don’t do “Design by Laptop”
• A/B Testing
• API versioning
51. Pull the Plug
•Create test environment
•Pull the plug
•Document
•Pull the plug again!
http://www.flickr.com/photos/rosipaw/5033284534/sizes/m/in/photostream/
52. Pull the Plug
•Create test environment
•Pull the plug
•Document
•Pull the plug again!
http://www.flickr.com/photos/rosipaw/5033284534/sizes/m/in/photostream/
54. • Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus
55. Free your mind...
• Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus
56. Free your mind...
• Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus
You are not Theo
57. Free your mind...
• Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus
You are not Theo You’re probably not Morpheus either
58. Free your mind...
• Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus
You are not Theo You’re probably not Morpheus either
59. Availability
• For a distributed system to be continuously
available, every request received by a non-failing
node in the system must result in a response.
• “Read globally,Write locally" with inconsistent
cache
• Service Level Agreements, even (especially?)
internally
60. Think Globally,
Act Locally
• Global but inconsistent aggregate view
• Local action where data is authoritative
• Autonomy
• “Rightsizing” your failure domain
http://www.flickr.com/photos/28634332@N05/3872137437/sizes/m/in/photostream/
61. Distributed Systems Design
• Avoid execution caching
• “Don’t lie, don’t retry”
• Embrace failure
• Don’t block the client
• Avoid internal policy
• Ensure the system makes forward
progress
64. • Distributed Throttling
• Staged / Pipeline with back pressure
• Measure scalability at each stage
• Degraded performance
• Make progress for admitted requests
• At odds with “stateless” / session-less
Admission
Control
http://www.flickr.com/photos/jayneandd/4450623309/sizes/m/in/photostream/
65. • Distributed Throttling
• Staged / Pipeline with back pressure
• Measure scalability at each stage
• Degraded performance
• Make progress for admitted requests
• At odds with “stateless” / session-less
Admission
Control
http://www.flickr.com/photos/jayneandd/4450623309/sizes/m/in/photostream/
66. Make Forward Progress
• MVCC, vector clocks, & reconciliation
• Don’t resurrect objects
• always go forward, never go back
• "name" is a property of an object, not its
unique key
• Break the link, garbage collect later
• Model “degraded service” performance
67. Request Signing
• Stateless - no session tracking to lose or to
purge later
• X509 - only public information on front-
end boxes. More secure against exploit
• Shared secret - faster, smaller signature but
requires secret info close to request front-
end
69. Control Chart
• Day over Day
• Same Day,Year overYear
• Confidence Intervals
“Shewhart stressed that bringing a production process into a state of statistical control, where there is
only common-cause variation, and keeping it in control, is necessary to predict future output and to
manage a process economically.”
• http://en.wikipedia.org/wiki/Control_chart
74. Performance
• Call length
• Cyclomatic Complexity
• Request ID flow
• Vertical vs Horizontal Scale
• tension between unit performance and
scalability
77. Successes
•Sharable “AMI”s
•Metadata (Simple and open again)
•Open API ( think Eucalyptus)
•No API throttling
•Primitives
•Pay-as you go
•Free traffic between S3 and EC2
•Data and Compute together
78. Failures
• SOAP makes little girls cry
• Amazon Web Services, circa 2006 was > 75%
REST or Query
• SOAP well supported by commercial vendors,
with their libraries
• Still *Way* too hard to use.
• Commodity business. Driving the bottom out of
cost causes quality to suffer.
• API vs UI?, User Experience in general
• IaaS (Infrastructure as a Service) is insufficient by
itself
a hangman's noose. EC2, and the other offerings,