Sistemas Distribuidos

distributed systems
diego souza @ infra-dev

agenda
● the basics
● models
● practical aspects

the basics
what is a distributed system? (cont.)
● a distributed system is a piece of software
that ensures that a collection of
independent computers appears to its
users as a single coherent system;

the basics
what is a distributed system? (cont.)
● a distributed system is a software system in
which components located on networked
computers communicate and coordinate
their actions by passing messages;

the basics
what is a distributed system?
● a distributed system is one in which the
failure of a computer you didn't even know
existed can render your own computer
unusable [Lamport];

the basics
fallacies of a distributed system
1. the network is reliable;
2. latency is zero;
3. bandwidth is infinite;
4. the network is secure;
5. topology doesn't change;
6. there is one administrator;
7. transport cost is zero;
8. the network is homogeneous;

the basics
examples:
● cassandra
● hadoop
● www
● internet
● etc.

the basics
why?
● things no longer fit in a single machine;
● scalability [size, geographic, organizational];
● availability;
● fault tolerance;
● performance;

the basics
scalability
● is the ability of a system, network, or
process, to handle a growing amount of
work in a capable manner or its ability to be
enlarged to accommodate that growth;

the basics
performance
● depends on the context and what we want
to achieve:
○ response time/low latency;
○ throughput;
○ utilization of computer resources;

the basics
latency
● the state of being latent; delay, a period
between the initiation of something and
the occurrence;
● a wise man once said:
○ Bandwidth is easy. Engineers build bandwidth. But
latency is hard. Only God gives us latency;

the basics
availability
● the proportion of time a system is in a
functioning condition. If a user cannot
access the system, it is said to be
unavailable;

the basics
fault tolerance
● ability of a system to behave in a well-defined
manner once faults occur;

models
availability metrics
availability = uptime / (uptime + downtime)
availability = mtbf / (mtbf + mttr)
mtbf: mean time between failure
mttr: mean time to repair
● q: is every second the same?

models
yield = successes / requests
● a: very unlikely!

models
harvest = data_available / total_data
● how incomplete is this [think of
websearch]?

models
distributing the dataset
● partition
● replication

models
partition
● improves performance [reduces dataset];
● improves availability [partial failures];
● usually application specific [random, time,
user];

models
replication
● improves performance [full copy];
● improves availability [full copy, reed-solomon
codes];
○ synchronous, asynchronous;
○ single copy, multi-master
○ crdts

models
replication [strong consistency]
● primary/copy [eg. mysql master]
● 2pc [eg. mysql cluster]
● paxos, zab, raft

models
replication [weak consistency]
● amazon dynamo
○ consistent hashing [partitioning]
○ partial quorums
○ failure detection and read repair
○ gossip protocol
● note: r + w > n != strong consistency

models
time
● global clock [ntp, total order]
● local clock [partial order]
● logical clock [partial order; lamport clock,
vector clocks]

models
consensus & atomic broadcast
● consensus: vote & agreement;
● atomic broadcast: reliable message
transmission and order guarantees;
● they are equivalent

models
flp impossibility
● does not exist an algorithm for the
consensus problem in an asynchronous
system subject to failures, even if messages
can never be lost, at most one process may
fail, and it can only fail by crashing
● note: its not that bad! :)

models
cap: [note: pick only two is misleading]
● consistency: the same data at the same
time;
● availability;
● partition tolerance: continues to operate
despite message loss [network or node
failure];

I find latency one of the most important
aspects of performance

hard to develop, even hard to operate: they
are not unbreakable

what to do in presence of failures

think about backpressure mechanisms

feature flag as a deploy mechanism

thanks :)
questions or comments?

appendix: what we have here
● cassandra
● zookeeper
● ceph
● etcd
● consul
● leela

links
● http://book.mixu.net/distsys/

Sistemas Distribuidos

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Sistemas Distribuidos

Similar to Sistemas Distribuidos (20)

More from Locaweb

More from Locaweb (6)

Recently uploaded

Recently uploaded (20)

Sistemas Distribuidos