11. Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
12. Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
13. Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
§ Half the storage overhead
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
14. Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
§ Half the storage overhead
§ Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
17. Erasure Coding Saves Storage
§ Facebook
- f4 stores 65PB of BLOBs in EC
§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
18. Erasure Coding Saves Storage
§ Facebook
- f4 stores 65PB of BLOBs in EC
§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
§ Google File System
- Large portion of data stored in EC
21. Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
22. Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
23. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
24. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
25. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
26. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
27. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
28. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
redundant data
29. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
30. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
31. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
32. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
33. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
34. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
35. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
36. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
37. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
38. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
39. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
40. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
41. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
42. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica
43. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0
44. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
45. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication
46. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2
47. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
48. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells
49. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1
50. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
51. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3)
52. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3
53. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
54. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4)
55. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4
56. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%
72. Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
73. Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%
36.03%
23.89%
2.03%
11.38%
86.59% file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
74. Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%
36.03%
23.89%
2.03%
11.38%
86.59% file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
3.20%
20.75%
76.05%
0.00%0.36%
99.64%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
92. Reconstruction on DataNode
§ Important to avoid delay on the critical path
- Especially if original data is lost
§ Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
§ New ErasureCodingWorker component on DataNode
93. Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
94. Acceleration with Intel ISA-L
§ 1 legacy coder
- From Facebook’s HDFS-RAID project
§ 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
107. EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
128~256MFile 0~128M … 640~768M
108. EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
128~256MFile 0~128M … 640~768M
109. EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile 0~128M … 640~768M
110. EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile … 640~768M