SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
DOSE: Deployment and Operations
for Software Engineers
The Cloud
Š Len Bass 2019 2
Overview
• Structure
• Failure in the Cloud
• Scaling Service Capacity
Š Len Bass 2019 3
Data centers in the cloud
• A cloud provider (AWS, Google, Microsoft) maintains
data centers around the world.
• Each data center has ~100,000 computers.
• Limited by power and cooling considerations.
Š Len Bass 2019 4
Data Center
Š Len Bass 2019 5
Organization of data centers
• Each data center has independent power supply,
independent fire control, independent security, etc
• Data centers are collected into availability zones and
availability zones are collected into geographic
regions.
• AWS currently has 20 regions each with several
availability zones.
Š Len Bass 2019 6
Allocating a virtual machine in
AWS - 1
• A user wishes to allocate a virtual machine in AWS
• The user specifies
• A region
• Availability zone
• Image to load into virtual machine
• …
Š Len Bass 2019 7
Allocating a virtual machine in
AWS - 2
• AWS management software
• Finds a server in that region and
availability zone with spare capacity
• Allocates a virtual machine in that server
• Assigns IP address to that virtual machine
(public)
• Loads image into that virtual machine
• VM can then send and receive
messages
Š Len Bass 2019 8
Overview
• Structure
• Failure in the Cloud
• Scaling Service Capacity
Š Len Bass 2019 9
A distributed system is one in which the failure of a
computer you didn't even know existed can render your
own computer unusable.
Leslie Lamport
Š Len Bass 2019 10
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail
Š Len Bass 2019 11
Sometimes the whole cloud fails
Selected Cloud Outages - 2018
• Feb 15, Google Cloud Datastore
• March 2, AWS Eastern region networking problems
• April 6, Microsoft Office 365. Users locked out of their accounts
• May 31, AWS Eastern Region hardware failure caused by
“power event”
• June 17, Microsoft Azure. Heatwave in Europe and
malfunctioning air conditioning
• July 17, Google. Load balancer issue affected services globally
• July 16, Amazon Prime. Not AWS but Amazon sales
• Sept 5. Microsoft Office 365. Botched update to Azure
authentication
Š Len Bass 2019 12
And sometimes just a part of it fails
…
Š Len Bass 2019 13
A year in the life of a Google datacenter
• Typical first year for a new cluster:
• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to
recover)
• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to
come back)
• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)
• ~5 racks go wonky (40-80 machines see 50% packetloss)
• ~8 network maintenances (4 might cause ~30-minute random connectivity
losses)
• ~12 router reloads (takes out DNS and external vips for a couple minutes)
• ~3 router failures (have to immediately pull traffic for an hour)
• ~dozens of minor 30-second blips for dns
• ~1000 individual machine failures
• ~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, dead
horses, …
Š Len Bass 2019 14
Amazon failure statistics
• In a data center with ~64,000 servers with 2 disks
each
~5 servers and ~17 disks fail every day.
Š Len Bass 2019 15
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail
Š Len Bass 2019 16
Short digression into probability
• A distribution describes the probability than any given
reading will have a particular value.
• Many phenomenon in nature are “normally distributed”.
• Most values will cluster
around the mean with
progressively smaller
numbers of values going
toward the edges.
• In a normal distribution
the mean is equal to the
median
Š Len Bass 2019 17
• In a long tail distribution, there are some
values far from the mean.
• These values are sufficient to influence
the mean.
• The mean and the
median are
dramatically
different in a long
tail distribution.
Long Tail
Š Len Bass 2019 18
• If there is a partial failure of the cloud some
activities will take a long time to complete and
exhibit a long tail.
• The figure shows
distribution of
1000 AWS
“launch instance”
calls.
• 4.5% of calls were
“long tail”
What does this mean?
Š Len Bass 2019 19
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail
Š Len Bass 2019 20
What can you do to prevent
long tail problems?
• Recognize that failure has occurred
• Recover from failure
• Mask failure
Š Len Bass 2019 21
Recognizing failure
• In a distributed system, the only
communication between distinct computers is
through messages.
• A request (message) to another machine
goes unanswered.
• The failure is recognized by the requestor
making the request asynchronously and
setting a timeout.
Š Len Bass 2019 22
Health Check
• Health check. Requires one machine to act as a
monitor for other machines.
• Machine sends a message to a monitor
periodically saying “I am alive”
• Ping/echo. Monitor sends a message to the
machine being monitored and expects a reply
within specified period.
Š Len Bass 2019 23
Recovering from failure
• Retry
• Instantiate another instance of failed service
• Graceful degradation
• Set “circuit breaker” to prevent trying to use service
again.
• Netflix Open Source Hystrix library provides some
useful functions
Š Len Bass 2019 24
Masking failure
• “Hedged” request. Suppose you wish to launch 10
instances. Issue 11 requests. Terminate the request
that has not completed when 10 are completed.
• “Alternative” request. In the above scenario, issue 10
requests. When 8 requests have completed issue 2
more. Cancel the last 2 to respond.
• Using these techniques reduces the time of the
longest of the 1000 launch instance requests from
202 sec to 51 sec.
Š Len Bass 2019 25
Overview
• Structure
• Failure in the Cloud
• Scaling Service Capacity
Š Len Bass 2019 26
Load Balancer
• One server may not suffice for all of the requests for
a given service.
• Have multiple servers supplying the same
service.
• Use “load balancer” to distribute requests.
• Server is registered with load balancer upon
initialization
• Load balancer monitors health of servers and knows
which ones are healthy.
• Load balancer IP is returned from DNS server when
client requests URL of service.
Š Len Bass 2019 27
Message sequence – client
makes a request
Servers
Clients
Load
Balance
r
Š Len Bass 2019 28
Message sequence- request
arrives at load balancer
Servers
Clients
Load
Balancer
Š Len Bass 2019 29
Message sequence – request is
send to one server
Servers
Clients
Load
Balance
r
Š Len Bass 2019 30
Message sequence – reply goes
directly back to sender
Servers
Clients
Load
Balance
r
Š Len Bass 2019 31
Note IP manipulation
• Server always sends message back to what it thinks
is the sender.
• Load balancer changes destination IP but not the
source. Then reply goes directly back to client
• Load balancer (now acting as a proxy) can change
origin as well. In this case, reply goes back to load
balancer which must change destination (of reply)
back to original client.
Š Len Bass 2019 32
Routing algorithms
• Load balancers use variety of algorithms to choose
instance for message
• Round robin. Rotate requests evenly
• Weighted round robin. Rotate requests according
to some weighting.
• Hashing – IP address of source to determine
instance. Means that a request from a particular
client always sent to same instance as long as it
is still in service.
• Note that these algorithms do not require knowledge
of an instance’s load.
Š Len Bass 2019 33
Suppose Load Balancer
Becomes Overloaded
– Load Balance the Load Balancers
Š Len Bass 2019 34
Combining pictures
• IP payload address is
modified multiple times
before message gets to
server
• Return message from
instance to client may go
through the gateway (proxy)
or may go directly back to
the client.
Cloud gatew
Client
Š Len Bass 2019 35
Summary
• The cloud consists of regions with availability zones
• A new VM is placed in an availability zone in a region
• Instances in the cloud can fail.
• Must be detected
• Long tail is a cloud phenomena that developers
must be aware of.
• Load balancers distribute requests among identical
servers.
Š Len Bass 2019 36
Overview
• Autoscaling
• Distributed Coordination
Š Len Bass 2019 37
Suppose servers become
overloaded
• As load grows, existing resources may not be
sufficient.
• Autoscaling is a mechanism for creating new
instances of a server.
• Set up a collection of rules that determine
• Under what conditions are new servers added
• Under what conditions are servers deleted
Š Len Bass 2019 38
First there were three servers
Servers
Load
Balancer
Š Len Bass 2019 39
Now there are four
Issues:
• What makes the decision to add a new server?
• How does the new server get loaded with software?
• How does the load balancer know about the new server?
Š Len Bass 2019 40
Making the decision
Monitor
Each server reports its CPU usage to a monitor
Monitor has a collection of rules to determine whether to
add new server. E.g. one server is over 80% utilization for
15 minutes
Š Len Bass 2019 41
When a new instance is created, it is
created based on a launch
configuration, security groups, etc
Loading the new server with
software
Configuration
specifications
Š Len Bass 2019 42
Making the load balancer new
server aware
The new server is registered with a load
balancer by the monitor.
Š Len Bass 2019 43
Overview
• Autoscaling
• Distributed Coordination
• Atomicity
• Locking
• Consensus algorithms
Š Len Bass 2019 44
Atomicity
• Some operations must be done atomically (or appear
to be atomic).
• Race conditions result from non atomic operations.
• Data base updates may result in inconsistent data
if not atomic
• Resources (e.g. memory) may be overwritten if not
atomic
• Interrupt handlers must be atomic
• etc
Š Len Bass 2019 45
How do I achieve (or appear to
achieve) atomicity?
• Time stamps?
• Many protocols involve putting a time stamp on
messages for error detection and ordering
purposes.
• Time stamps are often used to identify log
messages used for debugging problems.
• Couldn’t time stamps be used to order activities?
• Problem is that clocks drift and so clock time may be
different on different computers.
Š Len Bass 2019 46
Synchronizing clocks over networks
• Suppose two different computers are connected via a
network. How do they synchronize their clocks?
• If one computer sends its time reading to another, it takes
time for the message to arrive.
• NTP (Network Time Protocol) can be used to synchronize
time on a collection of computers.
• Accurate to around 1 millisecond in local area networks
• Accurate to around 10 milliseconds over public internet
• Congestion can cause errors of 100 milliseconds or
more.
Š Len Bass 2019 47
Suppose NTP is insufficiently
accurate
• Financial industry is spending 100s of millions of dollars
to reduce latency between Chicago and New York by 3
milliseconds.
• Well within error range of NTP
• GPS time is accurate within
• 14 nanoseconds (theoretically)
• 100 nanoseconds (mostly)Atomi
• Timestamp messages with GPS time
• Used by electric companies to measure phase angle
• Atomic clocks
• Used by Google to coordinate time across all of their
distributed systems.
Š Len Bass 2019 48
How about using locks?
• One technique to synchronize is to lock critical resources.
• Can lead to deadlock – two processes waiting for each
other to release critical resources
• Process one gets a lock on row 1 of a data base
• Process two gets a lock on row 2.
• Process one waits for process 2 to release its lock on
row 2
• Process two waits for process 1 to release its lock on
row 1
• No progress.
Š Len Bass 2019 49
More problems with locks
• Locks are logical structures maintained in software
or in persistent storage.
• Getting a lock across distributed systems is not an
atomic operation.
• It is possible that while requesting a lock another
process can acquire the lock. This can go on for a
long time (it is called livelock if there is no
possibility of ever acquiring a lock)
• Suppose the virtual machine holding the lock fails.
Then the owner of the lock can never release it.
Š Len Bass 2019 50
Locks in distributed systems
• Utilizing locks in a distributed system has a more
fundamental problem.
• It takes time to send a message (create a lock, for
example) from one computer to another.
Š Len Bass 2019 51
How much time do operations
in distributed systems take
• Main memory reference 100 ns
• Send 1K bytes over 1 Gbps network 0.01 ms
• Read 4K randomly from SSD. 15 ms
• Read 1 MB sequentially from memory 0.25 ms
• Round trip within same datacenter 0.5 ms
• Read 1 MB sequentially from SSD 1 ms
(4X memory)
• Disk seek
10 ms
(20x datacenter roundtrip)
• Read 1 MB sequentially from disk 20 ms
(80x memory, 20X SSD)
• Send packet CA->Netherlands->CA 150 ms
* dean-keynote-ladis2009_scalable_distributed_google_system
Š Len Bass 2019 52
Implications of latency
numbers
• State stored in persistent storage (disk or
SSD) will take longer to fetch than state
stored in memory.
• State stored in a different datacenter will take
longer to access than state stored locally,
especially across continents.
• Persistent store is typically replicated both for
performance (latency) reasons and for
availability (failure) reasons.
• => keeping data consistent across different
occurrences is important but difficult.
Š Len Bass 2019 53
Is there a solution?
• The general problem is that you want to manage
synchronization of data across a distributed set of
servers where up to half of the servers can fail.
• Paxos is a family of algorithms that use consensus
to manage concurrency. Complicated and difficult
to implement.
• An example of the implementation difficulty
• Choose one server as the master that keeps the
“authoritative” state.
• Now that server fails. Need to
• Find a new master
• Make sure it is up to data with the authoritative state.
Š Len Bass 2019 54
Luckily
• Several open source systems are now available that
• Implement Paxos or an alternative consensus
algorithm
• Are reasonably easy to use.
• Sample systems
• Zookeeper
• Consul
• etcd
Š Len Bass 2019 55
Zookeeper as an example
• Zookeeper provides a guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
• Definition of “mostly” is within eventual
consistency lag (but this is small).
• Zookeeper keeps data in memory. This is why it
is fast.
• Zookeeper deals with managing failure as well as
consistency.
• Done using consensus algorithm.
Š Len Bass 2019 56
Zookeeper znode structure
/
<data>
/b1
<data>
/b1/c1
<data>
/b1/c2
<data>
/b2
<data>
/b2/c1
<data>
Š Len Bass 2019 57
Zookeeper API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
• All calls return atomic views of state – either succeed or
fail. No partial state returned. Writes also are atomic.
Either succeed or fail. If they fail, no side effects.
Š Len Bass 2019 58
Example – Lock creation
• Locks have names.
• Client N
• Create Lock1
• If success then client owns lock.
• If failure, then it is placed on waitlist for lock 1 and control is
returned to Client N
• When client finishes, it deletes Lock1
• If there is a waitlist, then those clients are informed
about Lock 1 deletion and they try again to create
Lock 1
• If Client N fails, Zookeeper will delete Lock 1
and waitlist clients are informed of deletion
Š Len Bass 2019 59
Other use cases
• Leader election
• Groups membership
• Synchronization
• Configuration
Š Len Bass 2019 60
Summary
• Autoscaling creates new instances based on utilization
of existing instances
• Synchronization of data across a distributed system id
complicated
• It is difficult because of
• Timing delays across distributed systems
• Possible failures of nodes of various types
• Open source systems implement complicated
consensus algorithms in a fashion that is relatively
easy to use.

Weitere ähnliche Inhalte

Was ist angesagt?

11 secure development
11  secure development 11  secure development
11 secure development Len Bass
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure securityLen Bass
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsRakuten Group, Inc.
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesAdam Takvam
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basicssysnickm
 
Multi-Cloud Global Server Load Balancing (GSLB)
Multi-Cloud Global Server Load Balancing (GSLB)Multi-Cloud Global Server Load Balancing (GSLB)
Multi-Cloud Global Server Load Balancing (GSLB)Avi Networks
 
VMworld 2013: VMware Horizon Mirage Image Deployment Deep Dive
VMworld 2013: VMware Horizon Mirage Image Deployment Deep DiveVMworld 2013: VMware Horizon Mirage Image Deployment Deep Dive
VMworld 2013: VMware Horizon Mirage Image Deployment Deep DiveVMworld
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...The Linux Foundation
 
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit Sydney
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit SydneyPragmatic Container Security (Sponsored by Trend Micro) - AWS Summit Sydney
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit SydneyAmazon Web Services
 
Highly available cloud_foundry
Highly available cloud_foundryHighly available cloud_foundry
Highly available cloud_foundryHenry Sinclair
 
SDN in the Public Cloud: Windows Azure
SDN in the Public Cloud: Windows AzureSDN in the Public Cloud: Windows Azure
SDN in the Public Cloud: Windows AzureOpen Networking Summits
 
V mware thin app 4.5 what_s new presentation
V mware thin app 4.5 what_s new presentationV mware thin app 4.5 what_s new presentation
V mware thin app 4.5 what_s new presentationsolarisyourep
 
[WSO2Con EU 2017] Jump to the Next Curve with DevOps
[WSO2Con EU 2017] Jump to the Next Curve with DevOps[WSO2Con EU 2017] Jump to the Next Curve with DevOps
[WSO2Con EU 2017] Jump to the Next Curve with DevOpsWSO2
 
Latency - The King of the Mobile Experience
Latency - The King of the Mobile Experience Latency - The King of the Mobile Experience
Latency - The King of the Mobile Experience WardTechTalent
 
SAP ASE Migration Lessons Learned
SAP ASE Migration Lessons LearnedSAP ASE Migration Lessons Learned
SAP ASE Migration Lessons LearnedAliter Consulting
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundaryrcastain
 

Was ist angesagt? (20)

11 secure development
11  secure development 11  secure development
11 secure development
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deployments
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web Services
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basics
 
Chapter08
Chapter08Chapter08
Chapter08
 
Bluetube
BluetubeBluetube
Bluetube
 
Chapter09
Chapter09Chapter09
Chapter09
 
Multi-Cloud Global Server Load Balancing (GSLB)
Multi-Cloud Global Server Load Balancing (GSLB)Multi-Cloud Global Server Load Balancing (GSLB)
Multi-Cloud Global Server Load Balancing (GSLB)
 
Load balancing
Load balancingLoad balancing
Load balancing
 
VMworld 2013: VMware Horizon Mirage Image Deployment Deep Dive
VMworld 2013: VMware Horizon Mirage Image Deployment Deep DiveVMworld 2013: VMware Horizon Mirage Image Deployment Deep Dive
VMworld 2013: VMware Horizon Mirage Image Deployment Deep Dive
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit Sydney
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit SydneyPragmatic Container Security (Sponsored by Trend Micro) - AWS Summit Sydney
Pragmatic Container Security (Sponsored by Trend Micro) - AWS Summit Sydney
 
Highly available cloud_foundry
Highly available cloud_foundryHighly available cloud_foundry
Highly available cloud_foundry
 
SDN in the Public Cloud: Windows Azure
SDN in the Public Cloud: Windows AzureSDN in the Public Cloud: Windows Azure
SDN in the Public Cloud: Windows Azure
 
V mware thin app 4.5 what_s new presentation
V mware thin app 4.5 what_s new presentationV mware thin app 4.5 what_s new presentation
V mware thin app 4.5 what_s new presentation
 
[WSO2Con EU 2017] Jump to the Next Curve with DevOps
[WSO2Con EU 2017] Jump to the Next Curve with DevOps[WSO2Con EU 2017] Jump to the Next Curve with DevOps
[WSO2Con EU 2017] Jump to the Next Curve with DevOps
 
Latency - The King of the Mobile Experience
Latency - The King of the Mobile Experience Latency - The King of the Mobile Experience
Latency - The King of the Mobile Experience
 
SAP ASE Migration Lessons Learned
SAP ASE Migration Lessons LearnedSAP ASE Migration Lessons Learned
SAP ASE Migration Lessons Learned
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 

Ähnlich wie 3 the cloud

clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancingPrabhat gangwar
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesRightScale
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - TakeawaysManuela Grindei
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
The Need of Cloud-Native Application
The Need of Cloud-Native ApplicationThe Need of Cloud-Native Application
The Need of Cloud-Native ApplicationEmiliano Pecis
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMStanmayshah95
 
Architecting for the cloud scability-availability
Architecting for the cloud scability-availabilityArchitecting for the cloud scability-availability
Architecting for the cloud scability-availabilityLen Bass
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Microservices Development - ICP Workshop Batch II
Microservices Development - ICP Workshop Batch IIMicroservices Development - ICP Workshop Batch II
Microservices Development - ICP Workshop Batch IIPT Datacomm Diangraha
 
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)Amazon Web Services
 
EC2 BY RASHMI GR.pptx
EC2  BY RASHMI GR.pptxEC2  BY RASHMI GR.pptx
EC2 BY RASHMI GR.pptxRashmiR685840
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity securityLen Bass
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud applicationNoam Sheffer
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowAndrew Miller
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...WMLab,NCU
 
Load Balancing in Cloud Computing.pptx
Load Balancing in Cloud Computing.pptxLoad Balancing in Cloud Computing.pptx
Load Balancing in Cloud Computing.pptxPradipPoudel4
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2Amazon Web Services
 

Ähnlich wie 3 the cloud (20)

clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancing
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - Takeaways
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
The Need of Cloud-Native Application
The Need of Cloud-Native ApplicationThe Need of Cloud-Native Application
The Need of Cloud-Native Application
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMS
 
Architecting for the cloud scability-availability
Architecting for the cloud scability-availabilityArchitecting for the cloud scability-availability
Architecting for the cloud scability-availability
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Microservices Development - ICP Workshop Batch II
Microservices Development - ICP Workshop Batch IIMicroservices Development - ICP Workshop Batch II
Microservices Development - ICP Workshop Batch II
 
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
 
EC2 BY RASHMI GR.pptx
EC2  BY RASHMI GR.pptxEC2  BY RASHMI GR.pptx
EC2 BY RASHMI GR.pptx
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud application
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - Varrow
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
Load Balancing in Cloud Computing.pptx
Load Balancing in Cloud Computing.pptxLoad Balancing in Cloud Computing.pptx
Load Balancing in Cloud Computing.pptx
 
02-WhyCloud.pdf
02-WhyCloud.pdf02-WhyCloud.pdf
02-WhyCloud.pdf
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 

Mehr von Len Bass

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabusLen Bass
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020Len Bass
 
Quantum talk
Quantum talkQuantum talk
Quantum talkLen Bass
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorialLen Bass
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devopsLen Bass
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchainsLen Bass
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchainLen Bass
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipelineLen Bass
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool optionsLen Bass
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev opsLen Bass
 
Securing deployment pipeline
Securing deployment pipelineSecuring deployment pipeline
Securing deployment pipelineLen Bass
 
Deployability
DeployabilityDeployability
DeployabilityLen Bass
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureLen Bass
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providersLen Bass
 
Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build testLen Bass
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creatingLen Bass
 
Architecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsArchitecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsLen Bass
 

Mehr von Len Bass (18)

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool options
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev ops
 
Securing deployment pipeline
Securing deployment pipelineSecuring deployment pipeline
Securing deployment pipeline
 
Deployability
DeployabilityDeployability
Deployability
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study future
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providers
 
Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build test
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creating
 
Architecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsArchitecting for the cloud storage misc topics
Architecting for the cloud storage misc topics
 

KĂźrzlich hochgeladen

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto GonzĂĄlez Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

KĂźrzlich hochgeladen (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

3 the cloud

  • 1. DOSE: Deployment and Operations for Software Engineers The Cloud
  • 2. Š Len Bass 2019 2 Overview • Structure • Failure in the Cloud • Scaling Service Capacity
  • 3. Š Len Bass 2019 3 Data centers in the cloud • A cloud provider (AWS, Google, Microsoft) maintains data centers around the world. • Each data center has ~100,000 computers. • Limited by power and cooling considerations.
  • 4. Š Len Bass 2019 4 Data Center
  • 5. Š Len Bass 2019 5 Organization of data centers • Each data center has independent power supply, independent fire control, independent security, etc • Data centers are collected into availability zones and availability zones are collected into geographic regions. • AWS currently has 20 regions each with several availability zones.
  • 6. Š Len Bass 2019 6 Allocating a virtual machine in AWS - 1 • A user wishes to allocate a virtual machine in AWS • The user specifies • A region • Availability zone • Image to load into virtual machine • …
  • 7. Š Len Bass 2019 7 Allocating a virtual machine in AWS - 2 • AWS management software • Finds a server in that region and availability zone with spare capacity • Allocates a virtual machine in that server • Assigns IP address to that virtual machine (public) • Loads image into that virtual machine • VM can then send and receive messages
  • 8. Š Len Bass 2019 8 Overview • Structure • Failure in the Cloud • Scaling Service Capacity
  • 9. Š Len Bass 2019 9 A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
  • 10. Š Len Bass 2019 10 Failures in the cloud • Cloud failures large and small • The Long Tail • Techniques for dealing with the long tail
  • 11. Š Len Bass 2019 11 Sometimes the whole cloud fails Selected Cloud Outages - 2018 • Feb 15, Google Cloud Datastore • March 2, AWS Eastern region networking problems • April 6, Microsoft Office 365. Users locked out of their accounts • May 31, AWS Eastern Region hardware failure caused by “power event” • June 17, Microsoft Azure. Heatwave in Europe and malfunctioning air conditioning • July 17, Google. Load balancer issue affected services globally • July 16, Amazon Prime. Not AWS but Amazon sales • Sept 5. Microsoft Office 365. Botched update to Azure authentication
  • 12. Š Len Bass 2019 12 And sometimes just a part of it fails …
  • 13. Š Len Bass 2019 13 A year in the life of a Google datacenter • Typical first year for a new cluster: • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) • ~5 racks go wonky (40-80 machines see 50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~12 router reloads (takes out DNS and external vips for a couple minutes) • ~3 router failures (have to immediately pull traffic for an hour) • ~dozens of minor 30-second blips for dns • ~1000 individual machine failures • ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, dead horses, …
  • 14. Š Len Bass 2019 14 Amazon failure statistics • In a data center with ~64,000 servers with 2 disks each ~5 servers and ~17 disks fail every day.
  • 15. Š Len Bass 2019 15 Failures in the cloud • Cloud failures large and small • The Long Tail • Techniques for dealing with the long tail
  • 16. Š Len Bass 2019 16 Short digression into probability • A distribution describes the probability than any given reading will have a particular value. • Many phenomenon in nature are “normally distributed”. • Most values will cluster around the mean with progressively smaller numbers of values going toward the edges. • In a normal distribution the mean is equal to the median
  • 17. Š Len Bass 2019 17 • In a long tail distribution, there are some values far from the mean. • These values are sufficient to influence the mean. • The mean and the median are dramatically different in a long tail distribution. Long Tail
  • 18. Š Len Bass 2019 18 • If there is a partial failure of the cloud some activities will take a long time to complete and exhibit a long tail. • The figure shows distribution of 1000 AWS “launch instance” calls. • 4.5% of calls were “long tail” What does this mean?
  • 19. Š Len Bass 2019 19 Failures in the cloud • Cloud failures large and small • The Long Tail • Techniques for dealing with the long tail
  • 20. Š Len Bass 2019 20 What can you do to prevent long tail problems? • Recognize that failure has occurred • Recover from failure • Mask failure
  • 21. Š Len Bass 2019 21 Recognizing failure • In a distributed system, the only communication between distinct computers is through messages. • A request (message) to another machine goes unanswered. • The failure is recognized by the requestor making the request asynchronously and setting a timeout.
  • 22. Š Len Bass 2019 22 Health Check • Health check. Requires one machine to act as a monitor for other machines. • Machine sends a message to a monitor periodically saying “I am alive” • Ping/echo. Monitor sends a message to the machine being monitored and expects a reply within specified period.
  • 23. Š Len Bass 2019 23 Recovering from failure • Retry • Instantiate another instance of failed service • Graceful degradation • Set “circuit breaker” to prevent trying to use service again. • Netflix Open Source Hystrix library provides some useful functions
  • 24. Š Len Bass 2019 24 Masking failure • “Hedged” request. Suppose you wish to launch 10 instances. Issue 11 requests. Terminate the request that has not completed when 10 are completed. • “Alternative” request. In the above scenario, issue 10 requests. When 8 requests have completed issue 2 more. Cancel the last 2 to respond. • Using these techniques reduces the time of the longest of the 1000 launch instance requests from 202 sec to 51 sec.
  • 25. Š Len Bass 2019 25 Overview • Structure • Failure in the Cloud • Scaling Service Capacity
  • 26. Š Len Bass 2019 26 Load Balancer • One server may not suffice for all of the requests for a given service. • Have multiple servers supplying the same service. • Use “load balancer” to distribute requests. • Server is registered with load balancer upon initialization • Load balancer monitors health of servers and knows which ones are healthy. • Load balancer IP is returned from DNS server when client requests URL of service.
  • 27. Š Len Bass 2019 27 Message sequence – client makes a request Servers Clients Load Balance r
  • 28. Š Len Bass 2019 28 Message sequence- request arrives at load balancer Servers Clients Load Balancer
  • 29. Š Len Bass 2019 29 Message sequence – request is send to one server Servers Clients Load Balance r
  • 30. Š Len Bass 2019 30 Message sequence – reply goes directly back to sender Servers Clients Load Balance r
  • 31. Š Len Bass 2019 31 Note IP manipulation • Server always sends message back to what it thinks is the sender. • Load balancer changes destination IP but not the source. Then reply goes directly back to client • Load balancer (now acting as a proxy) can change origin as well. In this case, reply goes back to load balancer which must change destination (of reply) back to original client.
  • 32. Š Len Bass 2019 32 Routing algorithms • Load balancers use variety of algorithms to choose instance for message • Round robin. Rotate requests evenly • Weighted round robin. Rotate requests according to some weighting. • Hashing – IP address of source to determine instance. Means that a request from a particular client always sent to same instance as long as it is still in service. • Note that these algorithms do not require knowledge of an instance’s load.
  • 33. Š Len Bass 2019 33 Suppose Load Balancer Becomes Overloaded – Load Balance the Load Balancers
  • 34. Š Len Bass 2019 34 Combining pictures • IP payload address is modified multiple times before message gets to server • Return message from instance to client may go through the gateway (proxy) or may go directly back to the client. Cloud gatew Client
  • 35. Š Len Bass 2019 35 Summary • The cloud consists of regions with availability zones • A new VM is placed in an availability zone in a region • Instances in the cloud can fail. • Must be detected • Long tail is a cloud phenomena that developers must be aware of. • Load balancers distribute requests among identical servers.
  • 36. Š Len Bass 2019 36 Overview • Autoscaling • Distributed Coordination
  • 37. Š Len Bass 2019 37 Suppose servers become overloaded • As load grows, existing resources may not be sufficient. • Autoscaling is a mechanism for creating new instances of a server. • Set up a collection of rules that determine • Under what conditions are new servers added • Under what conditions are servers deleted
  • 38. Š Len Bass 2019 38 First there were three servers Servers Load Balancer
  • 39. Š Len Bass 2019 39 Now there are four Issues: • What makes the decision to add a new server? • How does the new server get loaded with software? • How does the load balancer know about the new server?
  • 40. Š Len Bass 2019 40 Making the decision Monitor Each server reports its CPU usage to a monitor Monitor has a collection of rules to determine whether to add new server. E.g. one server is over 80% utilization for 15 minutes
  • 41. Š Len Bass 2019 41 When a new instance is created, it is created based on a launch configuration, security groups, etc Loading the new server with software Configuration specifications
  • 42. Š Len Bass 2019 42 Making the load balancer new server aware The new server is registered with a load balancer by the monitor.
  • 43. Š Len Bass 2019 43 Overview • Autoscaling • Distributed Coordination • Atomicity • Locking • Consensus algorithms
  • 44. Š Len Bass 2019 44 Atomicity • Some operations must be done atomically (or appear to be atomic). • Race conditions result from non atomic operations. • Data base updates may result in inconsistent data if not atomic • Resources (e.g. memory) may be overwritten if not atomic • Interrupt handlers must be atomic • etc
  • 45. Š Len Bass 2019 45 How do I achieve (or appear to achieve) atomicity? • Time stamps? • Many protocols involve putting a time stamp on messages for error detection and ordering purposes. • Time stamps are often used to identify log messages used for debugging problems. • Couldn’t time stamps be used to order activities? • Problem is that clocks drift and so clock time may be different on different computers.
  • 46. Š Len Bass 2019 46 Synchronizing clocks over networks • Suppose two different computers are connected via a network. How do they synchronize their clocks? • If one computer sends its time reading to another, it takes time for the message to arrive. • NTP (Network Time Protocol) can be used to synchronize time on a collection of computers. • Accurate to around 1 millisecond in local area networks • Accurate to around 10 milliseconds over public internet • Congestion can cause errors of 100 milliseconds or more.
  • 47. Š Len Bass 2019 47 Suppose NTP is insufficiently accurate • Financial industry is spending 100s of millions of dollars to reduce latency between Chicago and New York by 3 milliseconds. • Well within error range of NTP • GPS time is accurate within • 14 nanoseconds (theoretically) • 100 nanoseconds (mostly)Atomi • Timestamp messages with GPS time • Used by electric companies to measure phase angle • Atomic clocks • Used by Google to coordinate time across all of their distributed systems.
  • 48. Š Len Bass 2019 48 How about using locks? • One technique to synchronize is to lock critical resources. • Can lead to deadlock – two processes waiting for each other to release critical resources • Process one gets a lock on row 1 of a data base • Process two gets a lock on row 2. • Process one waits for process 2 to release its lock on row 2 • Process two waits for process 1 to release its lock on row 1 • No progress.
  • 49. Š Len Bass 2019 49 More problems with locks • Locks are logical structures maintained in software or in persistent storage. • Getting a lock across distributed systems is not an atomic operation. • It is possible that while requesting a lock another process can acquire the lock. This can go on for a long time (it is called livelock if there is no possibility of ever acquiring a lock) • Suppose the virtual machine holding the lock fails. Then the owner of the lock can never release it.
  • 50. Š Len Bass 2019 50 Locks in distributed systems • Utilizing locks in a distributed system has a more fundamental problem. • It takes time to send a message (create a lock, for example) from one computer to another.
  • 51. Š Len Bass 2019 51 How much time do operations in distributed systems take • Main memory reference 100 ns • Send 1K bytes over 1 Gbps network 0.01 ms • Read 4K randomly from SSD. 15 ms • Read 1 MB sequentially from memory 0.25 ms • Round trip within same datacenter 0.5 ms • Read 1 MB sequentially from SSD 1 ms (4X memory) • Disk seek 10 ms (20x datacenter roundtrip) • Read 1 MB sequentially from disk 20 ms (80x memory, 20X SSD) • Send packet CA->Netherlands->CA 150 ms * dean-keynote-ladis2009_scalable_distributed_google_system
  • 52. Š Len Bass 2019 52 Implications of latency numbers • State stored in persistent storage (disk or SSD) will take longer to fetch than state stored in memory. • State stored in a different datacenter will take longer to access than state stored locally, especially across continents. • Persistent store is typically replicated both for performance (latency) reasons and for availability (failure) reasons. • => keeping data consistent across different occurrences is important but difficult.
  • 53. Š Len Bass 2019 53 Is there a solution? • The general problem is that you want to manage synchronization of data across a distributed set of servers where up to half of the servers can fail. • Paxos is a family of algorithms that use consensus to manage concurrency. Complicated and difficult to implement. • An example of the implementation difficulty • Choose one server as the master that keeps the “authoritative” state. • Now that server fails. Need to • Find a new master • Make sure it is up to data with the authoritative state.
  • 54. Š Len Bass 2019 54 Luckily • Several open source systems are now available that • Implement Paxos or an alternative consensus algorithm • Are reasonably easy to use. • Sample systems • Zookeeper • Consul • etcd
  • 55. Š Len Bass 2019 55 Zookeeper as an example • Zookeeper provides a guaranteed consistent (mostly) data structure for every instance of a distributed application. • Definition of “mostly” is within eventual consistency lag (but this is small). • Zookeeper keeps data in memory. This is why it is fast. • Zookeeper deals with managing failure as well as consistency. • Done using consensus algorithm.
  • 56. Š Len Bass 2019 56 Zookeeper znode structure / <data> /b1 <data> /b1/c1 <data> /b1/c2 <data> /b2 <data> /b2/c1 <data>
  • 57. Š Len Bass 2019 57 Zookeeper API Function Type create write delete write Exists read Get children Read Get data Read Set data write + others • All calls return atomic views of state – either succeed or fail. No partial state returned. Writes also are atomic. Either succeed or fail. If they fail, no side effects.
  • 58. Š Len Bass 2019 58 Example – Lock creation • Locks have names. • Client N • Create Lock1 • If success then client owns lock. • If failure, then it is placed on waitlist for lock 1 and control is returned to Client N • When client finishes, it deletes Lock1 • If there is a waitlist, then those clients are informed about Lock 1 deletion and they try again to create Lock 1 • If Client N fails, Zookeeper will delete Lock 1 and waitlist clients are informed of deletion
  • 59. Š Len Bass 2019 59 Other use cases • Leader election • Groups membership • Synchronization • Configuration
  • 60. Š Len Bass 2019 60 Summary • Autoscaling creates new instances based on utilization of existing instances • Synchronization of data across a distributed system id complicated • It is difficult because of • Timing delays across distributed systems • Possible failures of nodes of various types • Open source systems implement complicated consensus algorithms in a fashion that is relatively easy to use.