More Related Content Similar to NoSQL - how it works (@pavlobaron) (20) More from Pavlo Baron (11) NoSQL - how it works (@pavlobaron)3. NoSQL is not about …
<140’000 things NoSQL
is not about>…
NoSQL is about choice
(Jan Lehnardt on NoSQL)
22. Write back /
write snapshotting
read
cache
read write
miss back
products users
data store
30. write RM2 Gossip –
RM
RM1
Clock table
Value
Update log stable clock
Replica clock updates
Value
Executed operation table
31. Gossip – node down/up
Node 1
Node 2
update, read,
update update
4 down 4 up
Node 3 Node 4
update read
34. Timestamps
Node 1
10:00 10:10 10:20
Node 2
10:01 10:11 10:20
Node 3
9:59 10:09 10:18 10:19
35. Logical clocks
?
Node 1
1 4 5 6 7
Node 2
2 3 6 7
?
Node 3
2 4 5 6 7
36. Vector clocks
Node 1
1,0,0 2,2,0 3,2,0 4,3,3
Node 2
1,1,0 1,2,0 1,3,3 4,4,3
Node 3
1,0,1 1,2,2 1,2,3 4,3,4
37. Vector clocks
Node 1 Node 2 Node 3 Node 4
1,0,0,0
1,1,0,0 1,2,0,0 1,3,0,3
1,0,1,0 1,0,2,0
1,0,0,1 1,2,0,2 1,2,0,3
39. Merkle Trees
N, M: nodes
HT(N), HT(M): hash trees
M needs update:
obtain HT(N)
calc delta(HT(M), HT(N))
pull keys(delta)
40. Node a.1 Merkle Trees
a
ab ac
abc abd acb acc
abe abd ada adb
ab ad
a
Node a.2
41. Node a.1 Merkle Trees
a
ab
abc abd
abd ada adb
ab ad
a
Node a.2
42. Node 1 Vertical
sharding
users addresses
contracts
orders „read
contract“
user=foo
invoices
products items
Node 2
43. Node 1 Range based
sharding
users
id(1-N) addresses
zip(1234- read
2345)
write
products
write
addresses
users zip(2346- read
id(1-M) 9999)
Node 2
46. Add
2 nodes
rehash
leave
rehash
leave
48. Remove
node
rehash
leave
rehash
leave
49. The ring
X bit integer space
0 <= N <= 2 ^ X
or: 2 x Pi
0 <= A <= 2 x Pi
x(N) = cos(A)
y(N) = sin(A)
53. Add node
co
py
leave
leave
co
py
py
leave
co
54. Lookup key
(sloppy
quorum)
N
Value = “bar”
Key = “foo”
# = N, R = 2
59. In-database MapReduce
query =
Node X "Alice"
map reduce hit
list
map map
N= N= N=
„Alice" "Alice" "Alice"
Node A Node B Node C
62. Read your write consistency
FE1 FE2
write read write read
v2 v2 v1 v1
v1 v2 v3
Data store
63. Session consistency
FE
Session 1 Session 2
write read write read
v2 v2 v1 v1
v1 v2 v3
Data store
68. Source node Replication –
addresses state transfer
products
take
users
Target node
69. Source node Replication –
deletes operational
transfer
inserts
take
updates
run
Target node
70. Eager replication - 3PC
Coordinator
Cohort 1
can yes pre ACK commit ok
commit? commit
Cohort 2
71. Eager replication –
3PC (failure)
Coordinator
Cohort 1
can yes pre ACK abort ok
commit? commit
Cohort 2
72. Eager replication-
Paxos Commit
2F + 1 acceptors overall , F + 1
correct ones to achieve
consensus
Stability, Consistency,
Non-Triviality,
Non-Blocking
74. Eager replication – Paxos
Commit (failure)
Acceptors
2a prepared
2a prepared
timeout, timeout,
no no
decision decision
leader
initial
prepare
prepare
abort
begin commit
other
RMs RM 1
75. Master node Lazy replication –
master/slave
addresses
products write
users read
read
Slave node(s)
76. Master node(s) Lazy replication –
master/master
users items
id(1-N) id(1-K) write
read
users items read
id(1-M) id(1-L)
write
Master node(s)
77. Hinted handoff
N: node, G: group including N
node(N) is unavailable
replicate to G or
store data(N) locally
hint handoff for later
node(N) is alive
handoff data to node(N)
78. Key = “foo”, # = N -> Direct
handoff hint = true replica
fails
Key = “foo”
N
replicate
81. All
replicas
handoff recover
replicate
84. CAP – the variations
CA – irrelevant
CP – eventually unavailable
offering maximum consistency
AP – eventually inconsistent
offering maximum availability
86. Replica 1 CP
v1 read
v2 write
v2
v2
v1 read
Replica 2
87. Replica 1 CP (partition)
v1 read
v2 write
v2
v1 read
Replica 2
88. Replica 1 AP
v1 write
v2
v2 read
replicate
v2 v1 read
Replica 2
89. Replica 1 AP (partition)
v1 write
v2
v2 read
hint
handoff
v2
v1 read
Replica 2
92. Most queries are known up front
Ad-hoc queries are
seldom necessary
Prepared queries can
extremely speed up data retrieval
Index can help ad-hoc querying,
and can be externalized
Index should be incremental
94. The graph case
Saving graph in a table leads to:
Limited depth
Fixed relation types
Expensive nested subselects
Full table scan tendency
Graph data stores store graph
data optimally
96. Many graphics I’ve
created myself
Some images originate from
istockphoto.com
except few ones taken
from Wikipedia
and product pages