2. Š Len Bass 2019 2
Overview
⢠Structure
⢠Failure in the Cloud
⢠Scaling Service Capacity
3. Š Len Bass 2019 3
Data centers in the cloud
⢠A cloud provider (AWS, Google, Microsoft) maintains
data centers around the world.
⢠Each data center has ~100,000 computers.
⢠Limited by power and cooling considerations.
5. Š Len Bass 2019 5
Organization of data centers
⢠Each data center has independent power supply,
independent fire control, independent security, etc
⢠Data centers are collected into availability zones and
availability zones are collected into geographic
regions.
⢠AWS currently has 20 regions each with several
availability zones.
6. Š Len Bass 2019 6
Allocating a virtual machine in
AWS - 1
⢠A user wishes to allocate a virtual machine in AWS
⢠The user specifies
⢠A region
⢠Availability zone
⢠Image to load into virtual machine
⢠âŚ
7. Š Len Bass 2019 7
Allocating a virtual machine in
AWS - 2
⢠AWS management software
⢠Finds a server in that region and
availability zone with spare capacity
⢠Allocates a virtual machine in that server
⢠Assigns IP address to that virtual machine
(public)
⢠Loads image into that virtual machine
⢠VM can then send and receive
messages
8. Š Len Bass 2019 8
Overview
⢠Structure
⢠Failure in the Cloud
⢠Scaling Service Capacity
9. Š Len Bass 2019 9
A distributed system is one in which the failure of a
computer you didn't even know existed can render your
own computer unusable.
Leslie Lamport
10. Š Len Bass 2019 10
Failures in the cloud
⢠Cloud failures large and small
⢠The Long Tail
⢠Techniques for dealing with the long tail
11. Š Len Bass 2019 11
Sometimes the whole cloud fails
Selected Cloud Outages - 2018
⢠Feb 15, Google Cloud Datastore
⢠March 2, AWS Eastern region networking problems
⢠April 6, Microsoft Office 365. Users locked out of their accounts
⢠May 31, AWS Eastern Region hardware failure caused by
âpower eventâ
⢠June 17, Microsoft Azure. Heatwave in Europe and
malfunctioning air conditioning
⢠July 17, Google. Load balancer issue affected services globally
⢠July 16, Amazon Prime. Not AWS but Amazon sales
⢠Sept 5. Microsoft Office 365. Botched update to Azure
authentication
12. Š Len Bass 2019 12
And sometimes just a part of it fails
âŚ
13. Š Len Bass 2019 13
A year in the life of a Google datacenter
⢠Typical first year for a new cluster:
⢠~0.5 overheating (power down most machines in <5 mins, ~1-2 days to
recover)
⢠~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to
come back)
⢠~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)
⢠~5 racks go wonky (40-80 machines see 50% packetloss)
⢠~8 network maintenances (4 might cause ~30-minute random connectivity
losses)
⢠~12 router reloads (takes out DNS and external vips for a couple minutes)
⢠~3 router failures (have to immediately pull traffic for an hour)
⢠~dozens of minor 30-second blips for dns
⢠~1000 individual machine failures
⢠~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, dead
horses, âŚ
14. Š Len Bass 2019 14
Amazon failure statistics
⢠In a data center with ~64,000 servers with 2 disks
each
~5 servers and ~17 disks fail every day.
15. Š Len Bass 2019 15
Failures in the cloud
⢠Cloud failures large and small
⢠The Long Tail
⢠Techniques for dealing with the long tail
16. Š Len Bass 2019 16
Short digression into probability
⢠A distribution describes the probability than any given
reading will have a particular value.
⢠Many phenomenon in nature are ânormally distributedâ.
⢠Most values will cluster
around the mean with
progressively smaller
numbers of values going
toward the edges.
⢠In a normal distribution
the mean is equal to the
median
17. Š Len Bass 2019 17
⢠In a long tail distribution, there are some
values far from the mean.
⢠These values are sufficient to influence
the mean.
⢠The mean and the
median are
dramatically
different in a long
tail distribution.
Long Tail
18. Š Len Bass 2019 18
⢠If there is a partial failure of the cloud some
activities will take a long time to complete and
exhibit a long tail.
⢠The figure shows
distribution of
1000 AWS
âlaunch instanceâ
calls.
⢠4.5% of calls were
âlong tailâ
What does this mean?
19. Š Len Bass 2019 19
Failures in the cloud
⢠Cloud failures large and small
⢠The Long Tail
⢠Techniques for dealing with the long tail
20. Š Len Bass 2019 20
What can you do to prevent
long tail problems?
⢠Recognize that failure has occurred
⢠Recover from failure
⢠Mask failure
21. Š Len Bass 2019 21
Recognizing failure
⢠In a distributed system, the only
communication between distinct computers is
through messages.
⢠A request (message) to another machine
goes unanswered.
⢠The failure is recognized by the requestor
making the request asynchronously and
setting a timeout.
22. Š Len Bass 2019 22
Health Check
⢠Health check. Requires one machine to act as a
monitor for other machines.
⢠Machine sends a message to a monitor
periodically saying âI am aliveâ
⢠Ping/echo. Monitor sends a message to the
machine being monitored and expects a reply
within specified period.
23. Š Len Bass 2019 23
Recovering from failure
⢠Retry
⢠Instantiate another instance of failed service
⢠Graceful degradation
⢠Set âcircuit breakerâ to prevent trying to use service
again.
⢠Netflix Open Source Hystrix library provides some
useful functions
24. Š Len Bass 2019 24
Masking failure
⢠âHedgedâ request. Suppose you wish to launch 10
instances. Issue 11 requests. Terminate the request
that has not completed when 10 are completed.
⢠âAlternativeâ request. In the above scenario, issue 10
requests. When 8 requests have completed issue 2
more. Cancel the last 2 to respond.
⢠Using these techniques reduces the time of the
longest of the 1000 launch instance requests from
202 sec to 51 sec.
25. Š Len Bass 2019 25
Overview
⢠Structure
⢠Failure in the Cloud
⢠Scaling Service Capacity
26. Š Len Bass 2019 26
Load Balancer
⢠One server may not suffice for all of the requests for
a given service.
⢠Have multiple servers supplying the same
service.
⢠Use âload balancerâ to distribute requests.
⢠Server is registered with load balancer upon
initialization
⢠Load balancer monitors health of servers and knows
which ones are healthy.
⢠Load balancer IP is returned from DNS server when
client requests URL of service.
27. Š Len Bass 2019 27
Message sequence â client
makes a request
Servers
Clients
Load
Balance
r
28. Š Len Bass 2019 28
Message sequence- request
arrives at load balancer
Servers
Clients
Load
Balancer
29. Š Len Bass 2019 29
Message sequence â request is
send to one server
Servers
Clients
Load
Balance
r
30. Š Len Bass 2019 30
Message sequence â reply goes
directly back to sender
Servers
Clients
Load
Balance
r
31. Š Len Bass 2019 31
Note IP manipulation
⢠Server always sends message back to what it thinks
is the sender.
⢠Load balancer changes destination IP but not the
source. Then reply goes directly back to client
⢠Load balancer (now acting as a proxy) can change
origin as well. In this case, reply goes back to load
balancer which must change destination (of reply)
back to original client.
32. Š Len Bass 2019 32
Routing algorithms
⢠Load balancers use variety of algorithms to choose
instance for message
⢠Round robin. Rotate requests evenly
⢠Weighted round robin. Rotate requests according
to some weighting.
⢠Hashing â IP address of source to determine
instance. Means that a request from a particular
client always sent to same instance as long as it
is still in service.
⢠Note that these algorithms do not require knowledge
of an instanceâs load.
33. Š Len Bass 2019 33
Suppose Load Balancer
Becomes Overloaded
â Load Balance the Load Balancers
34. Š Len Bass 2019 34
Combining pictures
⢠IP payload address is
modified multiple times
before message gets to
server
⢠Return message from
instance to client may go
through the gateway (proxy)
or may go directly back to
the client.
Cloud gatew
Client
35. Š Len Bass 2019 35
Summary
⢠The cloud consists of regions with availability zones
⢠A new VM is placed in an availability zone in a region
⢠Instances in the cloud can fail.
⢠Must be detected
⢠Long tail is a cloud phenomena that developers
must be aware of.
⢠Load balancers distribute requests among identical
servers.
37. Š Len Bass 2019 37
Suppose servers become
overloaded
⢠As load grows, existing resources may not be
sufficient.
⢠Autoscaling is a mechanism for creating new
instances of a server.
⢠Set up a collection of rules that determine
⢠Under what conditions are new servers added
⢠Under what conditions are servers deleted
38. Š Len Bass 2019 38
First there were three servers
Servers
Load
Balancer
39. Š Len Bass 2019 39
Now there are four
Issues:
⢠What makes the decision to add a new server?
⢠How does the new server get loaded with software?
⢠How does the load balancer know about the new server?
40. Š Len Bass 2019 40
Making the decision
Monitor
Each server reports its CPU usage to a monitor
Monitor has a collection of rules to determine whether to
add new server. E.g. one server is over 80% utilization for
15 minutes
41. Š Len Bass 2019 41
When a new instance is created, it is
created based on a launch
configuration, security groups, etc
Loading the new server with
software
Configuration
specifications
42. Š Len Bass 2019 42
Making the load balancer new
server aware
The new server is registered with a load
balancer by the monitor.
44. Š Len Bass 2019 44
Atomicity
⢠Some operations must be done atomically (or appear
to be atomic).
⢠Race conditions result from non atomic operations.
⢠Data base updates may result in inconsistent data
if not atomic
⢠Resources (e.g. memory) may be overwritten if not
atomic
⢠Interrupt handlers must be atomic
⢠etc
45. Š Len Bass 2019 45
How do I achieve (or appear to
achieve) atomicity?
⢠Time stamps?
⢠Many protocols involve putting a time stamp on
messages for error detection and ordering
purposes.
⢠Time stamps are often used to identify log
messages used for debugging problems.
⢠Couldnât time stamps be used to order activities?
⢠Problem is that clocks drift and so clock time may be
different on different computers.
46. Š Len Bass 2019 46
Synchronizing clocks over networks
⢠Suppose two different computers are connected via a
network. How do they synchronize their clocks?
⢠If one computer sends its time reading to another, it takes
time for the message to arrive.
⢠NTP (Network Time Protocol) can be used to synchronize
time on a collection of computers.
⢠Accurate to around 1 millisecond in local area networks
⢠Accurate to around 10 milliseconds over public internet
⢠Congestion can cause errors of 100 milliseconds or
more.
47. Š Len Bass 2019 47
Suppose NTP is insufficiently
accurate
⢠Financial industry is spending 100s of millions of dollars
to reduce latency between Chicago and New York by 3
milliseconds.
⢠Well within error range of NTP
⢠GPS time is accurate within
⢠14 nanoseconds (theoretically)
⢠100 nanoseconds (mostly)Atomi
⢠Timestamp messages with GPS time
⢠Used by electric companies to measure phase angle
⢠Atomic clocks
⢠Used by Google to coordinate time across all of their
distributed systems.
48. Š Len Bass 2019 48
How about using locks?
⢠One technique to synchronize is to lock critical resources.
⢠Can lead to deadlock â two processes waiting for each
other to release critical resources
⢠Process one gets a lock on row 1 of a data base
⢠Process two gets a lock on row 2.
⢠Process one waits for process 2 to release its lock on
row 2
⢠Process two waits for process 1 to release its lock on
row 1
⢠No progress.
49. Š Len Bass 2019 49
More problems with locks
⢠Locks are logical structures maintained in software
or in persistent storage.
⢠Getting a lock across distributed systems is not an
atomic operation.
⢠It is possible that while requesting a lock another
process can acquire the lock. This can go on for a
long time (it is called livelock if there is no
possibility of ever acquiring a lock)
⢠Suppose the virtual machine holding the lock fails.
Then the owner of the lock can never release it.
50. Š Len Bass 2019 50
Locks in distributed systems
⢠Utilizing locks in a distributed system has a more
fundamental problem.
⢠It takes time to send a message (create a lock, for
example) from one computer to another.
51. Š Len Bass 2019 51
How much time do operations
in distributed systems take
⢠Main memory reference 100 ns
⢠Send 1K bytes over 1 Gbps network 0.01 ms
⢠Read 4K randomly from SSD. 15 ms
⢠Read 1 MB sequentially from memory 0.25 ms
⢠Round trip within same datacenter 0.5 ms
⢠Read 1 MB sequentially from SSD 1 ms
(4X memory)
⢠Disk seek
10 ms
(20x datacenter roundtrip)
⢠Read 1 MB sequentially from disk 20 ms
(80x memory, 20X SSD)
⢠Send packet CA->Netherlands->CA 150 ms
* dean-keynote-ladis2009_scalable_distributed_google_system
52. Š Len Bass 2019 52
Implications of latency
numbers
⢠State stored in persistent storage (disk or
SSD) will take longer to fetch than state
stored in memory.
⢠State stored in a different datacenter will take
longer to access than state stored locally,
especially across continents.
⢠Persistent store is typically replicated both for
performance (latency) reasons and for
availability (failure) reasons.
⢠=> keeping data consistent across different
occurrences is important but difficult.
53. Š Len Bass 2019 53
Is there a solution?
⢠The general problem is that you want to manage
synchronization of data across a distributed set of
servers where up to half of the servers can fail.
⢠Paxos is a family of algorithms that use consensus
to manage concurrency. Complicated and difficult
to implement.
⢠An example of the implementation difficulty
⢠Choose one server as the master that keeps the
âauthoritativeâ state.
⢠Now that server fails. Need to
⢠Find a new master
⢠Make sure it is up to data with the authoritative state.
54. Š Len Bass 2019 54
Luckily
⢠Several open source systems are now available that
⢠Implement Paxos or an alternative consensus
algorithm
⢠Are reasonably easy to use.
⢠Sample systems
⢠Zookeeper
⢠Consul
⢠etcd
55. Š Len Bass 2019 55
Zookeeper as an example
⢠Zookeeper provides a guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
⢠Definition of âmostlyâ is within eventual
consistency lag (but this is small).
⢠Zookeeper keeps data in memory. This is why it
is fast.
⢠Zookeeper deals with managing failure as well as
consistency.
⢠Done using consensus algorithm.
57. Š Len Bass 2019 57
Zookeeper API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
⢠All calls return atomic views of state â either succeed or
fail. No partial state returned. Writes also are atomic.
Either succeed or fail. If they fail, no side effects.
58. Š Len Bass 2019 58
Example â Lock creation
⢠Locks have names.
⢠Client N
⢠Create Lock1
⢠If success then client owns lock.
⢠If failure, then it is placed on waitlist for lock 1 and control is
returned to Client N
⢠When client finishes, it deletes Lock1
⢠If there is a waitlist, then those clients are informed
about Lock 1 deletion and they try again to create
Lock 1
⢠If Client N fails, Zookeeper will delete Lock 1
and waitlist clients are informed of deletion
59. Š Len Bass 2019 59
Other use cases
⢠Leader election
⢠Groups membership
⢠Synchronization
⢠Configuration
60. Š Len Bass 2019 60
Summary
⢠Autoscaling creates new instances based on utilization
of existing instances
⢠Synchronization of data across a distributed system id
complicated
⢠It is difficult because of
⢠Timing delays across distributed systems
⢠Possible failures of nodes of various types
⢠Open source systems implement complicated
consensus algorithms in a fashion that is relatively
easy to use.