Shared data systems try hardly to satisfy data consistency, system availability and tolerance to network partitions.
In a distributed system it is impossible to simultaneously provide all these guarantees at any given moment in time.
The purpose of the talk is to show the mechanism used by data storage systems such as Dynamo and BigTable in order to satisfy two guarantees at a time.
3. W H AT A D I S T R I B U T E D S Y S T E M I S
“A distributed system is a software system in which
components located on networked computers communicate
and coordinate their actions by passing messages”
4. D I S T R I B U T E D S Y S T E M S
E X A M P L E S
5. D I S T R I B U T E D S Y S T E M S
R E P L I C AT I O N
6. R E P L I C AT E D S E R V I C E
P R O P E R T I E S
CONSISTENCY
AVAILABILITY
7. C O N S I S T E N C Y
The result of operations will be predictable
8. C O N S I S T E N C Y
Strong consistency
all replicas return the same value for the same object
9. C O N S I S T E N C Y
Strong consistency
all replicas return the same value for the same object
Weak consistency
different replicas can return different values for the same object
11. S T R O N G V S W E A K
C O N S I S T E N C Y
Strong consistency
Atomic, consistent, isolated, durable database
Weak consistency
Basically Available Soft-state Eventual consistency database
12. E X A M P L E
C O N S I S T E N C Y
put(price, 10)
13. E X A M P L E
C O N S I S T E N C Y
get(price)
price = 10
17. PA R T I T I O N T O L E R A N C E
continue to operate even in presence of partitions
18. PA R T I T I O N T O L E R A N C E
Network failure
groups at each side of a faulty entity network (switch, backbone)
Process failure
system split in two groups: correct nodes and crashed node
19. C A P T H E O R E M
“Of three properties of shared-data systems
(data consistency, system availability and
tolerance to network partitions) only two can
be achieved at any given moment in time.”
20. T H E P R O O F
C A P T H E O R E M
put(price, 10)
get(price)
price = 0
price = 0 price = 0
price = 0
no response
not consistent
not available
t2
t1
partition 1
partition 2
25. R E Q U I R E M E N T S
D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”
26. R E Q U I R E M E N T S
D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”
➡ reliable
➡ high scalable
➡ always available
27. S I M P L E I N T E R FA C E
D Y N A M O
get(key)
returns the object associated with the key and returns a
single object or a list of objects with conflicting versions
along with a context.
put(key, context, object)
determines where the replicas of the object should be
placed based on the associated key. The context
includes information such as the version of the object.
28. R E P L I C AT I O N : T H E C H O I C E
D Y N A M O
Synchronous replica coordination
‣ strong consistency
‣ availability tradeoff
Optimistic replication technique
‣ high availability
‣ conflicts probability
29. C O N F L I C T S : W H E N
D Y N A M O
At write time
‣ writes rejection probability
At read time
‣ “always writable” datastore
30. C O N F L I C T S : W H O
D Y N A M O
The data store
‣ e.g. “last write win” policy
The application
‣ resolution as implementation detail
31. A R I N G T O R U L E T H E M A L L
D Y N A M O
32. PA R T I T I O N I N G : T H E R I N G
D Y N A M O
A
B
C
DE
F
G
DATA
hash
33. R E P L I C AT I O N
D Y N A M O
A
B
C
DE
F
G
N = 3 D will store keys in the range (A, B], (B, C], (C, D]
DATA
hash
34. D ATA V E R S I O N I N G
D Y N A M O
put()
may return before the update has been propagated to
all replicas.
get()
subsequent get() may return an object that does not
have the latest update
36. R E C O N C I L I AT I O N
D Y N A M O
Syntactic reconciliation
‣ new version subsumes the previous
Semantic reconciliation
‣ conflicting versions of the same object
38. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
39. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
write
handled by Sx
40. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
write
handled by Sx
write
handled by Sx
41. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
write
handled by Sx
write
handled by Sx
handled by Sywrite
42. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz
43. V E C T O R C L O C K
D Y N A M O
Definition
‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
D5 [Sx,3], [Sy,1], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz
reconciled and
written by Sx
44. P U T ( ) A N D G E T ( )
D Y N A M O
R
‣ minimum number of nodes that must partecipate
in a read operation.
W
‣ minimum number of nodes that must participate
in a successful write operation
45. P U T ( ) A N D G E T ( )
D Y N A M O
put()
‣ the coordinator generates the vector clock for the new version and
writes the new version locally
‣ the new version is sent to N nodes
‣ the write is successful if W-1 nodes respond
get()
‣ the coordinator requests all existing versions of data
‣ the coordinator waits for R responses before returning the result
‣ the coordinator returns all the version causally unrelated
‣ the divergent versions are reconciled and written back
46. S L O P P Y Q U O R U M
D Y N A M O
A
B
C
DE
F
G
N = 3
47. W H Y I S A P ?
D Y N A M O
‣ requests served even if some replicas are not available
‣ if some node is down the write is stored to another node
‣ consistency conflicts resolved at read time or in the
background
‣ eventually, all the replicas will converge
‣ concurrent read/write operation can make distinct clients
see distinct versions of the same key
49. R E Q U I R E M E N T S
G O O G L E B I G TA B L E
‣ scale to petabyte of data
‣ thousand of machines
‣ high availability
‣ high performance
50. D ATA M O D E L
G O O G L E B I G TA B L E
‣ sparse, distributed, persistent multi-dimensional
sorted map
(row: string, column: string, time: int64) string
51. R O W S
G O O G L E B I G TA B L E
‣ arbitrary strings
‣ read/write operations are atomic
‣ data is maintained in lexicographic order by row key
‣ each row range is called a tablet
maps.google.com com.google.maps
52. C O L U M N S
G O O G L E B I G TA B L E
‣ columns keys are grouped into sets: column families
‣ a column family must be created before data can be
stored under any column key in that family
‣ column key named as family:qualifier
‣ access control and both disk and memory
accounting are performed at the column-family level
53. T I M E S TA M P S
G O O G L E B I G TA B L E
C O N T E N T S :
c o m . e x a m p l e
< h t m l > …
< h t m l > …
t 1
t 2
54. D ATA M O D E L : E X A M P L E
G O O G L E B I G TA B L E
L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A
c o m . e x a m p l e e n
< ! D O C T Y P E
h t m l P U B L I C
…
c o m . c n n . w w w e n
< ! D O C T Y P E
h t m l P U B L I C
…
“ c n n " “ c n n . c o m ”
c o m . c n n . w w w / f o o e n
< ! D O C T Y P E
h t m l P U B L I C
…
column familiesrow keys
sortedrows
55. D I F F E R E N C E S W I T H R D B M S
G O O G L E B I G TA B L E
R D B M S B I G TA B L E
q u e r y l a n g u a g e s p e c i f i c a p i
j o i n s n o re f e re n t i a l i n t e g r i t y
e x p l i c i t s o r t i n g
s o r t i n g d e f i n e d a p r i o r i
i n t h e c o l u m n f a m i l y
56. A R C H I T E C T U R E
G O O G L E B I G TA B L E
Google File System (GFS)
‣ store data files and logs
Google SSTable
‣ store BigTable data
Chubby
‣ high-available distributed lock service
57. C O M P O N E N T S
G O O G L E B I G TA B L E
library
‣ linked into every client
one master server
‣ assigning tablets to tablet server
‣ detecting the addition and expiration of tablet servers
‣ balancing tablet-server load
‣ garbaging collection of files in GFS
‣ handling schema changes
many tablet servers
‣ manages 10 to 100 tablets
‣ handles read and write requests to the tablets
‣ splits tablets that have grown too large
58. C O M P O N E N T S
G O O G L E B I G TA B L E
Master server
Client
Tablet server Tablet server Tablet server
Metadata
read/write
59. S TA R T U P A N D G R O W T H
G O O G L E B I G TA B L E
Chubby file
Root tablet
1st Metadata tablet
other
metadata
tablets
UserTableN
UserTable1
…
…
…
…
…
…
…
…
…
…
…
60. TA B L E T A S S I G N M E N T
G O O G L E B I G TA B L E
tablet server
‣ when started, creates and acquires a lock in Chubby
master
‣ grabs a unique master lock in Chubby
‣ scans Chubby to find live tablet servers
‣ asks each tablet server to discover its tablets
‣ scans the Metadata table to learn the full set of tablets
‣ builds a set of unassigned tablet server, for future tablet
assignment
61. W H Y I S C P ?
G O O G L E B I G TA B L E
‣ master death cause services no longer functioning
‣ tablet server death cause tablets unavailable
‣ Chubby death cause BigTable inability to execute
synchronization operations and to serve client requests
‣ Google File System is a CP system
62. $ W H O A M I
Andrea Giuliano
@bit_shark
www.andreagiuliano.it
64. G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”
F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”
Assets:
https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg
https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg
https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg
https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg
https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg
https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg
https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg
https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg
https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg
https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg
https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg
https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg
https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg
https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg
https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg
https://www.flickr.com/photos/avardwoolaver/7137096221
R E F E R E N C E S