2. Versioned dictionaries
• put(k,ver,data)
Monday 12:00 v10
• get(k_start,k_end,ver)
• clone(v): create a child of v
Monday 16:00 v11
that inherits the latest version of
its keys
Now v12
3. Versioned dictionaries
• put(k,ver,data)
Monday 12:00 v10
• get(k_start,k_end,ver)
• clone(v): create a child of v
Monday 16:00 v11
that inherits the latest version of
its keys
Now v12
This talk: a versioned dictionary with fast updates,
and optimal space/query/update tradeoffs
4. Why?
• Powerful: cloning, time-travel,
cache and space-efficiency, ...
Monday 12:00 v10
• Give developers a recent
branch of live dataset
Monday 16:00 v11
• Expose different views of same
base dataset
Now v12 v13
Run analytics/tests/etc on
this clone, without
performance impact.
5. State of the art: copy-on-write
Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
the B-tree
6. State of the art: copy-on-write
Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
• Needs random IO to scale
• Concurrency is tricky
7. State of the art: copy-on-write
Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
• Needs random IO to scale
• Concurrency is tricky
A log file system makes updates sequential, but relies on
garbage collection (achilles heel!)
8. ~ log (2^30)/log 10000
= 3 IOs/update
CoW B-tree
[ZFS,WAFL,Btrfs,..]
O(logB Nv)
Update
random IOs
Range query
O(Z/B) random
(size Z)
Space O(N B logB Nv)
Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for flash..
9. important for flash
~ log (2^30)/log 10000 ~ log (2^30)/10000
= 3 IOs/update = 0.003 IOs/update
CoW B-tree
This talk
[ZFS,WAFL,Btrfs,..]
O(logB Nv) O((log Nv) / B)
Update
random IOs cache-oblivious IOs
Range query
O(Z/B) random O(Z/B) sequential
(size Z)
Space O(N B logB Nv) O(N)
Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for flash..
21. Doubling Array
Inserts
2 8 9 11
etc...
Similar to log-structured merge trees (LSM), cache-
oblivious lookahead array (COLA), ...
O(log N) “levels”, each element is rewritten once per level
O((log N) / B) IOs
27. Fractional Cascading
• Fractional Cascading:
Use information from search at level l
to help search at level l+1
• From each array, sample every 4th element
and put a pointer to it in previous level
28. Fractional Cascading
found entry
• Fractional Cascading:
Use information from search at level l
to help search at level l+1
• From each array, sample every 4th element
and put a pointer to it in previous level
29. Fractional Cascading
found entry
‘forward pointers’ give bounds for search in next array
• Fractional Cascading:
Use information from search at level l
to help search at level l+1
• From each array, sample every 4th element
and put a pointer to it in previous level
40. Adding versions
version 1
k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13
if layout is good for v1 ...
v1
v2
41. Adding versions
version 1
k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6
version 2
if layout is good for v1 ...
... then it’s bad for v2
v1
v2
42. Adding versions
version 1
k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6
version 2
if layout is good for v1 ...
... then it’s bad for v2
if you try to keep all versions of a key close... v1
k1 k2 k3 k4 k5 k6 k6 k7 k8 k9 k10 k11 k12 k13
v2
43. Adding versions
version 1
k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6
version 2
if layout is good for v1 ...
... then it’s bad for v2
if you try to keep all versions of a key close...
k1 k2 k3 k4 k5 k6 k6 k6 k6 k6 ... k7 k8 k9 k10 k11 k12 k13
... then it’s bad for all versions
versions 2, 3, 4, ...
44. Density
k0 k1 k2 k3
v0
v4 v0
v5
v4 v5
v1 v1
v2
v2 v3
v3
W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x
• Arrays are tagged with a version set W
45. Density
k0 k1 k2 k3 live at v1
v0
live(v1) = 4
live(v2) = 4 v4
live at v3 v0
live(v3) = 4
density = 4/8 v5
v4 v5
v1 v1
v2
v2 v3
v3
W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x
• f(A,v) = (#elements in A live at version v) / |A|
• density(A,W) = min{w in W} f(A,w)
46. Density
k0 k1 k2 k3 live at v1
v0
live(v1) = 4
live(v2) = 4 v4
live at v3 v0
live(v3) = 4
density = 4/8 v5
v4 v5
v1 v1
v2
v2 v3
v3
W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x
• f(A,v) = (#elements in A live at version v) / |A|
• density(A,W) = min{w in W} f(A,w)
• We say the array (A,W) is dense if density ≥1/5
• Tradeoff: high density means good range queries, but many duplicates
(imagine density 1 and density 1/N)
47. optimal bound of O(log Nv + Z/B). For much smaller range
queries, the worst-case performance may be the same as for
a point query. We now prove the amortized bound, which
Range queries
applies to smaller queries.
Theorem 2. A range query at version v costs O(log Nv +
Z/B) amortized I/Os.
(k,*)
Proof. We first consider just point queries, and amortize
the cost of lookup(k, v) over all keys live at v. Let l(k, v) be
•
the cost of lookup(k, v), then the amortized cost is given by
imagine scanning over each accessible array
k l(k, v)/Nv .
• density => trivially true for large (‘voluminous’) range queries
•
For anfor point queries: v, Ai ) be the number of I/Os used
array Ai , let l(k,
in examining elements in Ai for lookup(k,v). The idea is
• amortize over all k for a fixed version v
• each query examines disjoint regions of the array
• density implies total size examined = O(Nv log Nv)
48. Don’t worry, stay dense!
• Version sets disjoint at each level -- lookups examine one array/level
• merge arrays with intersecting version sets
• the result of a merge might not be dense
• Answer: density amplification!
promote merge density amplification demote
... ...
{1,2} {2,3}
{1,2,3}
{1,3} {1}
{4}
{4} {4}
50. e- If (A, V ) also satisfies (L-live) then every split of it does
(since all live elements are included), and likewise for (L-
“density amplification”
r-
h edge). It follows that version splitting (A , V ) – which
m necessarily has no promotable versions – results in a set of
arrays all of which satisfy all of the L-* conditions necessary
k0 k1 k2 k3
to stay atlive(v0) = 2
level l.
v0
s, density = 2/11
v4 live(v0) = 2
he The main result of k3
k0 k1 k2 this process is the following. live(v5) = 4
split 1 v5
al v0
v1 density = 2/4
n v4
ut Lemma 3 (Promotion). T he fraction of lead elements
v5
v2
e, over v1 l output arrays after a version split is ≥ 1/39.
al v3
v2
v3
Proof. First, we claim that under k0 k1same conditions
the k2 k3
st as the version split lemma, if in addition |A| < 2M live(v4) = 2
v0 and
n split 2
live(v) >= M/3 for all v, then the number of output strata = 3
v0 v4 live(v1)
re live(v2) = 3
is at most 13. Consider the arrays which obey the live(v3) = 3
v5 lead
o v4 v5
fraction constraint. Each has sizev1at least M/3, since at
v1
ng least one version is split live in it, and least half of the array is= 2/7
1 v2
density
d lead, sov2at least M/6 lead keys. The total number of lead
v3
v3
re keys in split 2 array A is ≤ 2M , since the array itself is no
the
ui larger than this; it follows that there can be no more than
51. O n snapshot or clone of version v to new descendant ver- ou tpu t
sion v , v is registered for each array A which is currently
3.9
Update bound
registered to the parent of v. T his does not require any I / Os.
Update
T he th
rays ca
ting.
Theorem 1. The stratified doubling array performs up-
dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10
amortized I/Os.
For lar
Z = Ω
Proof. A ssume we have at our disposal a memory buffer proper
of size at least B (recall that B is not known to the algo- op tima
rithm). T hen each array that is involved in a disk merge queries
has size at least B , so a merge of some number of arrays of a poin
total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies
each element exists in exactly one array and may participate
in O (log N ) merges, which immediately gives the desired
amortized bound. In the scheme described here, elements The
may exist in many arrays, and elements may participate in Z/B) a
many merges at the same level (eg when an array at level
l is version split and some subarrays remain at level l after Pro
the version split). N evertheless, we shall prove the theorem the cos
52. O n snapshot or clone of version v to new descendant ver- ou tpu t
sion v , v is registered for each array A which is currently
3.9
Update bound
registered to the parent of v. T his does not require any I / Os.
Update
T he th
rays ca
ting.
Theorem 1. The stratified doubling array performs up-
dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10
amortized I/Os.
For lar
Z = Ω
• Not possible to use basic amortized method (some elements in
Proof.arrays; somehave at ourmerged many times)
A ssume we elements disposal a memory buffer proper
many
of size at least B (recall that B is not known to the algo- op tima
•
rithm). T hen each array of merges/splits to leaddisk merge only queries
Idea: charge the cost that is involved in a elements
• (k,v) appears as lead in of some array -> always N total leadpoin
has size at least B , so a merge exactly 1 number of arrays of a
total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies
•
each element exists in exactly one array andpromotion
each lead element receives $c/B on may participate
•
in O (log N ) merges, which immediately v / B) the desired
total charge for version v is O(log N gives
amortized bound. In the scheme described here, elements The
may exist in many arrays, and elements may participate in Z/B) a
many merges at the same level (eg when an array at level
l is version split and some subarrays remain at level l after Pro
the version split). N evertheless, we shall prove the theorem the cos
53. 9: return [split(r)]
O n snapshot or clone of version v to new descendant ver- ou tpu t
sion v , v is registered for each array A which is currently
Update bound
registered to the parent of v. T his does not require any I / Os.
there is a version split of (A, V ), say (Ai , Vi ) for i = 1 . . . n,
such that each array satisfies ( L-dense) and ( L-size) for level
T he th
rays ca
l, and Updateat most one index i for which lead(Ai ) <
3.9 there is ting.
|AiTheorem 1. The stratified doubling array performs up-
|/2.
dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10
amortized I/Os.
For lar
If (A, V ) also satisfies (L-live) then every split of it does Z = Ω
•
(since all live elements basic amortized method (some elements in
Not possible to use are included), and likewise for (L-
Proof.arrays; somehave at ourmerged many times)
A ssume we elements disposal a memory buffer proper
many
edge). It follows that version splitting (A , V ) – which
of size at least B (recall that B is not known to the algo- op tima
•
necessarily has no promotable versions – results in a set of only
rithm). T hen each array of merges/splits to leaddisk merge
Idea: charge the cost that is involved in a elements
arrays all of which satisfy all of the L-* conditions necessary
queries
• (k,v) appears as lead in of some array -> always N total leadpoin
has size at least B , so a merge exactly 1 number of arrays of
to stay at level l.
a
total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies
• element exists in exactly one array andpromotion
each lead element receives $c/B on may
eachmain result of this process is the following. participate
The
•
in O (log N ) merges, which immediately v / B) the desired
total charge for version v is O(log N gives
amortized bound. In the scheme described here, elements The
may exist 3 many arrays, andTelements may lead elements
in (Promotion). he fraction of participate in Z/B) a
Lemma
over al merges at the same level (eg when is ≥array at level
many l output arrays after a version split an 1/39.
l is version split and some subarrays remain at level l after Pro
the version split). N evertheless, we shall prove the theorem the cos
55. Insert rate, as a function of dictionary size
1e+06
100000
Inserts per second
10000
1000
100 Stratified B-tree
CoW B-tree
1 10
Keys (millions)
~3 OoM
56. Range rate, as a function of dictionary size
1e+09
1e+08
Reads per second
1e+07
1e+06
100000
Stratified B-tree
CoW B-tree
10000
1 10
Keys (millions)
~1 OoM
57. bitbucket.org/acunu
www.acunu.com/download
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.
Hinweis der Redaktion
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
LolCoW. if you want to do fast updates, then CoW technique cannot help -- the cow is built around the assumption that every update can do a lookup, and update reference counts\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
\n
\n
Example of density amplification. The merged array has density $\\frac{2}{11} < \\frac{1}{5}$, so it is not dense. We find a split into two parts: the first split $(A_{1},\\{v_{0},v_{5}\\})$ has size 4 and density $\\frac{1}{2}$. The second split $(A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\})$ has size 7 and density $\\frac{2}{7}$. Both splits have size $<8$ and density $\\ge \\frac{1}{5}$, so they can remain at the current level.\n\nWe start at the root version and greedily search for a version $v$ and some subset of its children whose split arrays can be merged into one dense array at level $l$. More precisely, letting $\\mathcal{U}=\\bigcup_{i} \\mathcal{W'}[v_{i}]$, we search for a subset of $v$'s children $\\{v_{i}\\}$ such that \n$$|\\mathrm{split}(\\mathcal{A'},\\mathcal{U}) | < 2^{l+1}.$$ \n\nIf no such set exists at $v$, we recurse into the child $v_{i}$ maximizing $|\\mathrm{split}(\\mathcal{A'}, \\mathcal{W'}[v_{i}])|$. It is possible to show that this always finds a dense split. Once such a set $\\mathcal{U}$ is identified, the corresponding array is written out, and we recurse on the remainder $\\mathrm{split}(\\mathcal{A'}, \\mathcal{W'} \\setminus \\mathcal{U})$. Figure \\ref{fig:split} gives an example of density amplification.\n\n\n
Example of density amplification. The merged array has density $\\frac{2}{11} < \\frac{1}{5}$, so it is not dense. We find a split into two parts: the first split $(A_{1},\\{v_{0},v_{5}\\})$ has size 4 and density $\\frac{1}{2}$. The second split $(A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\})$ has size 7 and density $\\frac{2}{7}$. Both splits have size $<8$ and density $\\ge \\frac{1}{5}$, so they can remain at the current level.\n\nWe start at the root version and greedily search for a version $v$ and some subset of its children whose split arrays can be merged into one dense array at level $l$. More precisely, letting $\\mathcal{U}=\\bigcup_{i} \\mathcal{W'}[v_{i}]$, we search for a subset of $v$'s children $\\{v_{i}\\}$ such that \n$$|\\mathrm{split}(\\mathcal{A'},\\mathcal{U}) | < 2^{l+1}.$$ \n\nIf no such set exists at $v$, we recurse into the child $v_{i}$ maximizing $|\\mathrm{split}(\\mathcal{A'}, \\mathcal{W'}[v_{i}])|$. It is possible to show that this always finds a dense split. Once such a set $\\mathcal{U}$ is identified, the corresponding array is written out, and we recurse on the remainder $\\mathrm{split}(\\mathcal{A'}, \\mathcal{W'} \\setminus \\mathcal{U})$. Figure \\ref{fig:split} gives an example of density amplification.\n\n\n
\n
\n
\n
\n
The plot shows range query performance (elements/s extracted using range queries of size 1000).\nThe CoW B-tree is limited by random IO here ((100/s*32KB)/(200 bytes/key) = 16384 key/s), but the Stratified B-tree is CPU-bound (OCaml is single-threaded).\nPreliminary performance results from a highly-concurrent in-kernel implementation suggest that well over 500k updates/s are possible with 16 cores \n