SlideShare ist ein Scribd-Unternehmen logo
1 von 66
2016/9/1
Takuya Fukudome, Yahoo Japan Corporation
HDFS Erasure Coding
in Action
2
Yahoo! JAPAN
Providing over 100 services
on PC and mobile
64.9billion PV/month
Hadoop cluster
Total 6000Nodes
120PB of variety data
・
・
・
Search
Shopping
News
Auction
Mail
Agenda
3
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
Agenda
4
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
What is Erasure Coding?
A technique to earn data availability and durability
5
Original raw data
Data chunks Parity
striping encoding
Storing raw data and parity
Key points
• Missing or corrupted data will be reconstruct with living
data and parity
• Parity is typically smaller than original data
6
HDFS Erasure Coding
• New feature in Hadoop 3.0
• To reduce storage overhead
• Half of the tripled replication
• Same durability as replication
7
Agenda
8
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
Implementation (Phase1)
• Striped Block Layout
• Striping original raw data
• Encode striped data to parity data
• Target on cold data
• Not modified and rarely accessed
• Reed Solomon(6,3) Codec
• System default codec
• 6 data and 3 parity blocks
9
10
Striped Block Management
• Raw data is striped
• The basic unit of striped data
is called “cell”
• The ”cell” is 64kB
64kB
Originalrawdata
striping
11
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
12
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
13
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
14
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
15
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
16
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
17
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
18
Striped Block Management
• The cells are written in blocks
in order
• With striped layout
64kB
Originalrawdata
striping
Striped Block Management
The six cells of raw data
will be used to calculate three parities
19
Raw data Parity
Striped Block Management
The six data cells and three parity cells
are named “stripe”
20
Stripe
Raw data Parity
Striped Block Management
Every stripes are written in blocks in order
21
Raw data Parity
Striped Block Management
Every stripes are written in blocks in order
22
Raw data Parity
Striped Block Management
Block Group
•Combining data and parity striped blocks
23
Striped raw data Striped parity data
index0 index1 index2 index3 index4 index5 index6 index7 index8
Internal Blocks
• The striped blocks in Block Group are called ”internal block”
• Every internal block has index
24
Striped raw data Striped parity data
index0 index1 index2 index3 index4 index5 index6 index7 index8
Agenda
25
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
26
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
27
It will not be modified
and rarely accessed.
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
28
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
29
The tripled replication mechanism could tolerant missing 2/3
replica.
In the case of the Erasure Coding, if 3/9 of storages were failed,
missing data could be reconstructed
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
30
The data locality would be lost by using the Erasure Coding.
However, cold data in the Erasure Coding would not be accessed
frequently.
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
31
In the Erasure Coding, the calculation of the parity data will
decrease the write throughput.
In the reading situation, the performance will not decrease so
much.
But if some internal blocks were missing, the reading throughput
would be drop down.
Replication vs EC
Replication Erasure Coding(RS-6-
3)
Target HOT COLD
Storage overhead 200% 50%
Data durability ✔ ✔
Data Locality ✔ ✖
Write
performance
✔ ✖
Read
performance
✔ △
32
In the Erasure Coding, in order to recovery the missing data, a
node need to read other living raw data and parity data from
remote.
And then the node reconstruct missing data.
These process will use network traffics and CPU resources.
Agenda
33
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. The results of the erasure coding testing
• System tests
• Performance tests
3. Usage in our production
• Principal workloads in our production
• Future plan
HDFS EC Tests
• System Tests
• Basic operations
• Reconstruct EC blocks
• Decommission DataNodes
• Other Features
34
HDFS EC Tests
• System Tests
• Basic operations
• Reconstruct EC blocks
• Decommission DataNodes
• Other Features
35
Basic Operations
“hdfs erasurecode –setPolicy”
• Target
• Only directory
• Must be empty
• Sub-directory and files inherit policy
• Superuser privilege needed
• Default policy: Reed Solomon(6,3)
36
Basic Operations
To confirm whether the directory has the Erasure Coding
policy
“hdfs erasurecode –getPolicy”
• Show the information about the codec and cell size
37
File Operations
After setting EC policy,
Basic file operations are conducted against EC files
• File format transparent to HDFS clients
• Write/read datanode failure tolerant
Move operation is a little different.
• File format not automatically changed by move operation.
38
HDFS EC Tests
• System Tests
• Basic operations
• Reconstruct EC blocks
• Decommission DataNodes
• Other Features
39
Reconstruct EC Blocks
The missing blocks can be reconstructed with at least 6 living
internal blocks.
40
Reading 6 living blocks and reconstruct missing block
Reconstruct EC Blocks
• The cost of the reconstruction is irrelevant with missing internal
block count
• No matter one or three internal blocks are missing, the
reconstruction costs are the same
• block groups with more missing internal blocks has higher priority
41
Rack Fault Tolerance
BlockPlacementPolicyRackFaultTolerant
• A new block placement policy
• Choses the storages to distributed blocks to racks as
many as possible
42
DN1 DN2 DN3 DN4 DN5 DN6
DN7 DN8 DN9
Rack1 Rack2 Rack3 Rack4 Rack5 Rack6
HDFS EC Tests
• System Tests
• Basic operations
• Reconstruct EC blocks
• Decommission DataNodes
• Other Features
43
Decommission DNs
Decommission is basically same as recovery blocks
But, transfer blocks to another DataNode is enough in Erasure
Coding.
44
Decommissioning node Other living nodes
Internal block
Simple transfer
Internal block
HDFS EC Tests
• System Tests
• Basic operations
• Reconstruct EC blocks
• Decommission DataNodes
• Other Features
45
Supported
• NameNode HA
• Quotation Configuration
• HDFS File System Check
46
Unsupported
• Flush and Synchronize(hflush/hsync)
• Append to EC files
• Truncate EC files
These features are not supported.
However, those are not so critical, because the target of HDFS erasure coding
is storing cold data.
47
Agenda
48
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. The results of the erasure coding testing
• System tests
• Performance tests
3. Usage in our production
• Principal workloads in our production
• Future plan
HDFS EC Tests
• Performance Tests
• Writing/Reading Throughput
• TeraGen/TeraSort
• Distcp
49
50
Cluster Information
Alpha
• 37 Nodes
• 28 cores CPU * 2
• 256GB RAM
• SATA 4TB * 12 Disks
• Network 10Gbps
• One Rack
Beta
• 82 Nodes
• 28 cores CPU * 2
• 128GB RAM
• SATA 4TB * 15 Disks
• Network 10Gbps
• Five Racks
HDFS EC Tests
• Performance Tests
• Writing/Reading Throughput
• TeraGen/TeraSort
• Distcp
51
Read/Write Throughput
ErasureCodingBenchmarkThroughput
• Write and read files on the replication and erasure coding formats with
multithreads
• In Erasure Coding, writing throughput was about 65% of replication’s
• Reading throughput decreased slightly in Erasure Coding
52
Replication Erasure
Coding
Write 111.3 MB/s 73.94 MB/s
Read(stateful) 111.3 MB/s 111.3 MB/s
Read(positional
)
111.3 MB/s 107.79 MB/s
Read With Missing Internal Blocks
The throughput decreased when internal blocks was missing
Because the client needed to reconstruct missing blocks
53
All blocks
living
An internal block
missing
Read(stateful) 111.3 MB/s 108.94 MB/s
Read(positional) 107.79 MB/s 92.25 MB/s
HDFS EC Tests
• Performance Tests
• Writing/Reading Throughput
• TeraGen/TeraSort
• Distcp
54
55
CPU Time of TeraGen
X-axis: number of rows of outputs
Y-axis: log scaled CPU times(ms)
The CPU time increased in the erasure
coding format.
The overhead of the calculate parity
data affected writing performance.
56
TeraSort Map/Reduce Time Cost
Total time spent by map(left) and reduce(right)
X-axis: number of rows of outputs
Y-axis: log scaled CPU times(ms)
The times spent by the reduce tasks
increased significantly in the erasure
coding.
Because the main workload of the
reduce was writing.
HDFS EC Tests
• Performance Tests
• Writing/Reading Throughput
• TeraGen/TeraSort
• Distcp
57
Elapsed Time of Distcp
• Our real log data(2TB)
• Copying replication to EC was tripled of copying replication to replication
• Currently using distcp is the best way to convert to Erasure Coding
58
Real elapsed time Cpu time
Replication to EC 17mins, 11sec 64,754,291ms
Replication to
Replication
5mins, 5sec 20,187,156ms
Agenda
59
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
60
Target
Raw data of weblogs
Daily total 6.2TB
Up to 400 days
The capacity used of our production
HDFS.
We need to reduce storage space cost.
EC with Archive Disk
We are using Erasure Coding with archive storages
The cost of archive disk is 70% of normal disk.
In total, the storage cost of EC with archive disk could be reduced to 35% of x3
replication with normal disk.
61
X3 replication with normal disk Erasure coding with archive disk
35% of x3 replica
62
EC Files as Hive Input
Erasure Coding ORC files
2TB, 17billion records
• The query execution time seemed to
increase when the input data is
erasure coded files
• But it will be acceptable, considering
the queries are rarely executed
Queries
Agenda
63
1. About HDFS Erasure Coding
• Key points
• Implementation
• Compare to replication
2. HDFS Erasure Coding Tests
• System Tests
• Basic Operations
• Reconstruct Erasure Coding Blocks
• Other features
• Performance Tests
3. Usage in Yahoo! JAPAN
• Principal workloads in our production
• Future plan
Future Phase of HDFS EC
• Codecs
• Intel ISA-L
• Hitchhiker algorithm
• Contiguous layout
• To provide data locality
• Implement hflush/hsync
If they were implemented, the erasure coding format would be
used in much more scenarios
64
Conclusion
• HDFS Erasure Coding
• The target is storing cold data
• It reduces half storage costs without sacrificing
data durability
• It’s ready for production
65
Thanks for Listening!

Weitere ähnliche Inhalte

Was ist angesagt?

Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Masahiko Sawada
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 

Was ist angesagt? (20)

Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Life with big Firebird databases
Life with big Firebird databasesLife with big Firebird databases
Life with big Firebird databases
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 

Andere mochten auch

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 

Andere mochten auch (20)

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 
図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 

Ähnlich wie HDFS Erasure Coding in Action

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Make static instrumentation great again, High performance fuzzing for Windows...
Make static instrumentation great again, High performance fuzzing for Windows...Make static instrumentation great again, High performance fuzzing for Windows...
Make static instrumentation great again, High performance fuzzing for Windows...
Lucas Leong
 

Ähnlich wie HDFS Erasure Coding in Action (20)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech lab
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
Make static instrumentation great again, High performance fuzzing for Windows...
Make static instrumentation great again, High performance fuzzing for Windows...Make static instrumentation great again, High performance fuzzing for Windows...
Make static instrumentation great again, High performance fuzzing for Windows...
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

HDFS Erasure Coding in Action

  • 1. 2016/9/1 Takuya Fukudome, Yahoo Japan Corporation HDFS Erasure Coding in Action
  • 2. 2 Yahoo! JAPAN Providing over 100 services on PC and mobile 64.9billion PV/month Hadoop cluster Total 6000Nodes 120PB of variety data ・ ・ ・ Search Shopping News Auction Mail
  • 3. Agenda 3 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 4. Agenda 4 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 5. What is Erasure Coding? A technique to earn data availability and durability 5 Original raw data Data chunks Parity striping encoding Storing raw data and parity
  • 6. Key points • Missing or corrupted data will be reconstruct with living data and parity • Parity is typically smaller than original data 6
  • 7. HDFS Erasure Coding • New feature in Hadoop 3.0 • To reduce storage overhead • Half of the tripled replication • Same durability as replication 7
  • 8. Agenda 8 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 9. Implementation (Phase1) • Striped Block Layout • Striping original raw data • Encode striped data to parity data • Target on cold data • Not modified and rarely accessed • Reed Solomon(6,3) Codec • System default codec • 6 data and 3 parity blocks 9
  • 10. 10 Striped Block Management • Raw data is striped • The basic unit of striped data is called “cell” • The ”cell” is 64kB 64kB Originalrawdata striping
  • 11. 11 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 12. 12 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 13. 13 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 14. 14 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 15. 15 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 16. 16 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 17. 17 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 18. 18 Striped Block Management • The cells are written in blocks in order • With striped layout 64kB Originalrawdata striping
  • 19. Striped Block Management The six cells of raw data will be used to calculate three parities 19 Raw data Parity
  • 20. Striped Block Management The six data cells and three parity cells are named “stripe” 20 Stripe Raw data Parity
  • 21. Striped Block Management Every stripes are written in blocks in order 21 Raw data Parity
  • 22. Striped Block Management Every stripes are written in blocks in order 22 Raw data Parity
  • 23. Striped Block Management Block Group •Combining data and parity striped blocks 23 Striped raw data Striped parity data index0 index1 index2 index3 index4 index5 index6 index7 index8
  • 24. Internal Blocks • The striped blocks in Block Group are called ”internal block” • Every internal block has index 24 Striped raw data Striped parity data index0 index1 index2 index3 index4 index5 index6 index7 index8
  • 25. Agenda 25 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 26. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 26
  • 27. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 27 It will not be modified and rarely accessed.
  • 28. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 28
  • 29. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 29 The tripled replication mechanism could tolerant missing 2/3 replica. In the case of the Erasure Coding, if 3/9 of storages were failed, missing data could be reconstructed
  • 30. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 30 The data locality would be lost by using the Erasure Coding. However, cold data in the Erasure Coding would not be accessed frequently.
  • 31. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 31 In the Erasure Coding, the calculation of the parity data will decrease the write throughput. In the reading situation, the performance will not decrease so much. But if some internal blocks were missing, the reading throughput would be drop down.
  • 32. Replication vs EC Replication Erasure Coding(RS-6- 3) Target HOT COLD Storage overhead 200% 50% Data durability ✔ ✔ Data Locality ✔ ✖ Write performance ✔ ✖ Read performance ✔ △ 32 In the Erasure Coding, in order to recovery the missing data, a node need to read other living raw data and parity data from remote. And then the node reconstruct missing data. These process will use network traffics and CPU resources.
  • 33. Agenda 33 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. The results of the erasure coding testing • System tests • Performance tests 3. Usage in our production • Principal workloads in our production • Future plan
  • 34. HDFS EC Tests • System Tests • Basic operations • Reconstruct EC blocks • Decommission DataNodes • Other Features 34
  • 35. HDFS EC Tests • System Tests • Basic operations • Reconstruct EC blocks • Decommission DataNodes • Other Features 35
  • 36. Basic Operations “hdfs erasurecode –setPolicy” • Target • Only directory • Must be empty • Sub-directory and files inherit policy • Superuser privilege needed • Default policy: Reed Solomon(6,3) 36
  • 37. Basic Operations To confirm whether the directory has the Erasure Coding policy “hdfs erasurecode –getPolicy” • Show the information about the codec and cell size 37
  • 38. File Operations After setting EC policy, Basic file operations are conducted against EC files • File format transparent to HDFS clients • Write/read datanode failure tolerant Move operation is a little different. • File format not automatically changed by move operation. 38
  • 39. HDFS EC Tests • System Tests • Basic operations • Reconstruct EC blocks • Decommission DataNodes • Other Features 39
  • 40. Reconstruct EC Blocks The missing blocks can be reconstructed with at least 6 living internal blocks. 40 Reading 6 living blocks and reconstruct missing block
  • 41. Reconstruct EC Blocks • The cost of the reconstruction is irrelevant with missing internal block count • No matter one or three internal blocks are missing, the reconstruction costs are the same • block groups with more missing internal blocks has higher priority 41
  • 42. Rack Fault Tolerance BlockPlacementPolicyRackFaultTolerant • A new block placement policy • Choses the storages to distributed blocks to racks as many as possible 42 DN1 DN2 DN3 DN4 DN5 DN6 DN7 DN8 DN9 Rack1 Rack2 Rack3 Rack4 Rack5 Rack6
  • 43. HDFS EC Tests • System Tests • Basic operations • Reconstruct EC blocks • Decommission DataNodes • Other Features 43
  • 44. Decommission DNs Decommission is basically same as recovery blocks But, transfer blocks to another DataNode is enough in Erasure Coding. 44 Decommissioning node Other living nodes Internal block Simple transfer Internal block
  • 45. HDFS EC Tests • System Tests • Basic operations • Reconstruct EC blocks • Decommission DataNodes • Other Features 45
  • 46. Supported • NameNode HA • Quotation Configuration • HDFS File System Check 46
  • 47. Unsupported • Flush and Synchronize(hflush/hsync) • Append to EC files • Truncate EC files These features are not supported. However, those are not so critical, because the target of HDFS erasure coding is storing cold data. 47
  • 48. Agenda 48 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. The results of the erasure coding testing • System tests • Performance tests 3. Usage in our production • Principal workloads in our production • Future plan
  • 49. HDFS EC Tests • Performance Tests • Writing/Reading Throughput • TeraGen/TeraSort • Distcp 49
  • 50. 50 Cluster Information Alpha • 37 Nodes • 28 cores CPU * 2 • 256GB RAM • SATA 4TB * 12 Disks • Network 10Gbps • One Rack Beta • 82 Nodes • 28 cores CPU * 2 • 128GB RAM • SATA 4TB * 15 Disks • Network 10Gbps • Five Racks
  • 51. HDFS EC Tests • Performance Tests • Writing/Reading Throughput • TeraGen/TeraSort • Distcp 51
  • 52. Read/Write Throughput ErasureCodingBenchmarkThroughput • Write and read files on the replication and erasure coding formats with multithreads • In Erasure Coding, writing throughput was about 65% of replication’s • Reading throughput decreased slightly in Erasure Coding 52 Replication Erasure Coding Write 111.3 MB/s 73.94 MB/s Read(stateful) 111.3 MB/s 111.3 MB/s Read(positional ) 111.3 MB/s 107.79 MB/s
  • 53. Read With Missing Internal Blocks The throughput decreased when internal blocks was missing Because the client needed to reconstruct missing blocks 53 All blocks living An internal block missing Read(stateful) 111.3 MB/s 108.94 MB/s Read(positional) 107.79 MB/s 92.25 MB/s
  • 54. HDFS EC Tests • Performance Tests • Writing/Reading Throughput • TeraGen/TeraSort • Distcp 54
  • 55. 55 CPU Time of TeraGen X-axis: number of rows of outputs Y-axis: log scaled CPU times(ms) The CPU time increased in the erasure coding format. The overhead of the calculate parity data affected writing performance.
  • 56. 56 TeraSort Map/Reduce Time Cost Total time spent by map(left) and reduce(right) X-axis: number of rows of outputs Y-axis: log scaled CPU times(ms) The times spent by the reduce tasks increased significantly in the erasure coding. Because the main workload of the reduce was writing.
  • 57. HDFS EC Tests • Performance Tests • Writing/Reading Throughput • TeraGen/TeraSort • Distcp 57
  • 58. Elapsed Time of Distcp • Our real log data(2TB) • Copying replication to EC was tripled of copying replication to replication • Currently using distcp is the best way to convert to Erasure Coding 58 Real elapsed time Cpu time Replication to EC 17mins, 11sec 64,754,291ms Replication to Replication 5mins, 5sec 20,187,156ms
  • 59. Agenda 59 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 60. 60 Target Raw data of weblogs Daily total 6.2TB Up to 400 days The capacity used of our production HDFS. We need to reduce storage space cost.
  • 61. EC with Archive Disk We are using Erasure Coding with archive storages The cost of archive disk is 70% of normal disk. In total, the storage cost of EC with archive disk could be reduced to 35% of x3 replication with normal disk. 61 X3 replication with normal disk Erasure coding with archive disk 35% of x3 replica
  • 62. 62 EC Files as Hive Input Erasure Coding ORC files 2TB, 17billion records • The query execution time seemed to increase when the input data is erasure coded files • But it will be acceptable, considering the queries are rarely executed Queries
  • 63. Agenda 63 1. About HDFS Erasure Coding • Key points • Implementation • Compare to replication 2. HDFS Erasure Coding Tests • System Tests • Basic Operations • Reconstruct Erasure Coding Blocks • Other features • Performance Tests 3. Usage in Yahoo! JAPAN • Principal workloads in our production • Future plan
  • 64. Future Phase of HDFS EC • Codecs • Intel ISA-L • Hitchhiker algorithm • Contiguous layout • To provide data locality • Implement hflush/hsync If they were implemented, the erasure coding format would be used in much more scenarios 64
  • 65. Conclusion • HDFS Erasure Coding • The target is storing cold data • It reduces half storage costs without sacrificing data durability • It’s ready for production 65

Hinweis der Redaktion

  1. I’m Takuya Fukudome from Yahoo JAPAN. In this session, I’ll talk about HDFS Erasure Coding in Action
  2. At first, please let me introduce Yahoo! JAPAN We Yahoo Japan are providing lots of web services on PC and mobile in Japan. Thanks to these services and our users, about 120PB of varieties data is being accumulated into our hadoop clusters. In this session, I’ll share about the new HDFS feature, erasure coding, which is being used in our cluster and help us save about half of the storage costs. Improving overall ROI without notable loss of data durability, HDFS erasure coding is the feature you might would like to import to your data platform.
  3. In this session. First of all, I’ll share some key points in detail about HDFS erasure coding. Then, the results of the erasure coding testing in our cluster will be shared and discussed. Finally, walking through usage and practical workloads of the HDFS erasure coding in our production cluster.
  4. First, I’ll introduce the HDFS erasure coding.
  5. What is Erasure Coding? Erasure coding is a technique to earn data availability and durability which original raw data is split into chunks, and encoded the data stripes to parity data. These striped raw data and parity data will be stored in different storages.
  6. If some data chunks were lost or corrupted, missed chunks were reconstructed from the living raw data and the parity data. Because the parity data is typically smaller than original raw data, so the erasure coding could save the storage space better.
  7. The erasure coding is a new feature in HDFS, it’s planned to be release in hadoop 3.0. This feature aims to reduce the overhead of the storage space. The storage overhead of the erasure coding format files will be half of the tripled replication mechanism. And the same level data durability as replication could be provided by HDFS Erasure Coding as well.
  8. Next I’ll introduce the implementation of Erasure Coding in HDFS.
  9. In the HDFS erasure coding, the original raw data and its parity data are written into storages with striped layout. In the phase1, erasure coding targets on cold data which will be not modified and rarely accessed. The codec which is used to calculate parity data is Reed Solomon(6,3). In this codec, the original raw data will be striped and written into 6 blocks. Also 3 blocks of parity data will be calculated and written as well.
  10. Let’s take a look at how data is striped and written in Erasure Coding.
  11. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  12. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  13. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  14. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  15. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  16. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  17. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  18. As described before, original raw data is striped. The basic unit of striped data is called “cell” and the ”cell” is 64KB typically. And in the Reed Solomon(6,3), the cells are written into 6 blocks in order.
  19. Then, six cells of raw data could be used to calculate three cells of parity data. And these nine cells were named “stripe”.
  20. The parity data are calculated in every stripe of the original raw data and each parity data are written in parity blocks. There are always 3 parity data cells in every stripe and they are written in 3 blocks in order.
  21. The parity data are calculated in every stripe of the original raw data and each parity data are written in parity blocks. There are always 3 parity data cells in every stripe and they are written in 3 blocks in order.
  22. The parity data are calculated in every stripe of the original raw data and each parity data are written in parity blocks. There are always 3 parity data cells in every stripe and they are written in 3 blocks in order.
  23. Finally, combining 6 original raw data blocks and 3 parity data blocks, there comes an Erasure Coding Block Group.
  24. And these striped blocks in an block group is respectively named internal blocks. Each internal block has its own index. The index of the internal block is decided according to the order when write raw data and parity data. Also these indexes would be used when read and reconstruction data. The internal blocks which has indexes from the 0 to 5 holds raw data, and the others which has the indexes from 6 to 8 hold parity data.
  25. At the end of introduction to HDFS Erasure Coding, let’s compare erasure coding and replication mechanism in several dimensions.
  26. So that we could get more concrete impression of HDFS Erasure Coding.
  27. First, not like replication, erasure coding targeted on cold data which would not be modified and rarely accessed.
  28. Next, the storage overhead of the tripled replication is 200%, on the other hand, the storage over head of the erasure coding is approximately 50%. Erasure Coding save the storage by 50% percent comparing to tripled Replication.
  29. As to data durability. tripled replication mechanism could tolerant missing two of the three replica. In the case of Reed Solomon(6,3) Erasure Coding, on the other hand, even if three ninths of the storages were failed, HDFS could reconstruct missing data. The durability of the erasure coding is in the similar level as replication.
  30. When it comes to data locality, unfortunately the data locality would be lost by using the erasure coding. Because in the erasure coding, each striped data are stored on the different DataNodes. These could slow down the jobs running against Erasure Coding files. However, typically cold data in Erasure Coding format would not be accessed frequently.
  31. Next two categories are write performance and read performance. In the erasure coding, the calculation of the parity data decrease the write throughput. And in the reading situation, the performance will not decrease so much. But if some internal blocks were missing, the reading throughput would drop down. Because the client will have to reconstruct missing blocks.
  32. The last one is about the cost of data recovering. In the case of using the replication, copying replica to the another DataNode is enough. While in erasure coding, in order to recovery the missing data, a node need to read other living raw data and parity data from remote and reconstruct data. These processes will use network traffic and CPU resources. Therefore the cost of data recovery will be higher in the erasure coding.
  33. We have walked through the basic key points of Erasure Coding, then we’ll check the test results of Erasure Coding. Tests were conducted in the two major categories.
  34. Firstly, system tests. In system test, basic operations, such as write and read Erasure Coding file, were tested. And then important features like NameNode HA and auto reconstruction involved with erasure coding files were also tested.
  35. First, I’ll talk about basic operations like write and read
  36. By the way, along with checking the basic operations tests, one could be able to figure out how to start using Erasure Coding as well. First of all, for using Erasure Coding. A new subcommand “easurecode” is added to the “hdfs” command and “hdfs erasurecode –setPolicy” is the starting point to use the HDFS erasure coding. It sets erasure coding policy to the specified directory, and this directory should be empty. Sub-directories and files under the Erasure Coding directories inherit the Erasure Coding policy. After setting Erasure Coding policy, the files write to or distcp to the Erasure Coding directory are stored as erasure coding files. The erasure coding policy contains the codec and cell size which are using in erasure coding. Please make sure to have superuser privilege to use set policy command. In the phase 1, the system default policy is Reed Solomon(6,3).
  37. And if you want to confirm whether the directory has erasure coding policy or not, you can run ”hdfs erasurecode –getPolicy” command. It will show the information about the codec and cell size which is set to the specified path.
  38. After setting Erasure Coding policy, basic file operations were conducted against Erasure Coding file. Including write, read, copy and move operations. Basically we would be able to write and read files without considering whether the target path is replication or erasure coding. And furthermore, write and read against Erasure Coding files could tolerant reasonable count of datanode failures during operations. Copy operation is basically same as write and read. However move operation is a little different. Please be aware of that the files’ format are not automatically changed between replication and erasure coding by move operation. When the erasure code files are moved to normal directory which doesn’t have erasure coding policy, the files will still remain erasure coding format and vice versa.
  39. Next is about reconstruct Erasure Coding blocks
  40. about auto reconstruction of the erasure coding files. In case of using the system default codec, Reed Solomon(6,3), the missing blocks can be reconstructed with at least 6 living internal blocks.
  41. Actually, the cost of the reconstruction of the missing internal blocks is irrelevant with missing internal block count. No matter one or three internal blocks are missing, the reconstruction costs are the same. So in reconstruction, block groups with more missing internal blocks has higher priority, because they are more dangerous for be lost.
  42. Block group recovery with rack failures were tested too. Along with erasure coding support was added to the HDFS, a new block placement policy, “BlockPlacementPolicyRackFaultTolerant” was also added. In this policy, NameNode will chose the storages to distributed internal blocks to racks as many as possible. Actually we can write some erasure coding files with nodes under a few racks, but it’s better to have at least 6 racks. That is enough to satisfy the rack fault tolerant and prevent to lost lots of internal blocks if a rack fails.
  43. Next, I’ll talk about decommission
  44. Decommission is also very important, and it’s basically as same as recovery blocks in the perspective of data completeness. The difference is for erasure coding, if all internal blocks were living, transfer blocks between nodes do not need reconstruction. This optimization is good. Because the cost of reconstruction of the internal blocks is higher than transfer blocks.
  45. And we’ve tested other HDFS features with Erasure Coding files,
  46. for example, NameNode HA, Quota, and HDFS fsck. These features works fine with Erasure Coding. Especially fsck has been modified to print the necessary information of the Block Groups.
  47. But in the phase 1, some features are not yet been supported in erasure coding. For example, hflush and hsync are not supported in erasure coding phase1. Therefore appending to the erasure coding files and truncating erasure coding files are not supported neither. However, that is not so critical, because the target of the HDFS erasure coding is storing cold data in the current phase.
  48. After checking system test, let’s head to performance tests.
  49. First, the result of the writing and reading benchmark would be shared. Then, we would check the performance of Erasure Coding files along with teragen and terasort, the typical mapreduce jobs. Finally, the very practical distcp function which would be widely used were identified by performance perspective.
  50. And let me introduce our test clusters. We used two clusters in this tests. They are alpha cluster and beta cluster. There are 37 Nodes and 82 Nodes. We did reading, writing tests and teragen, terasort tests in alpha cluster and the tests which uses real data were did in the beta cluster.
  51. First is the result of reading and writing benchmark.
  52. In this test, we used the benchmark tool, ErasureCodingBenchmarkThroughput. It has been added to the HDFS test package. This tool write and read files on the replication and erasure coding formats with multithreads. This table is the result of the benchmark in our cluster. Writing throughput decreased in the erasure coding format as expected. Writing the erasure coded files’ throughput is about 65% of the throughput writing replication files. And reading throughput in the erasure coding slightly decrease in the erasure coding format. But the throughput of the stateful reading is the same as replication.
  53. In addition, we’ve tested reading with missing internal blocks. Please look at this table. You could find out that the throughput dropped down when an internal data block was missing. Because the client needed to reconstruct missing blocks before it could read the data.
  54. And we tested teragen and terasort, the typical mapreduce job benchmark as well.
  55. This chart shows the cpu time of the teragen jobs. The vertical axis shows log scaled CPU time and horizontal axis shows the number of rows of the output files. The cpu time increased in the erasure coding format more than replication format as expected. The overhead of the calculate parity data affected writing performance.
  56. And please look at these charts. These chars show the total time spent by map and reduce of the terasort jobs. As you can see, the times spent by the map tasks are approximately the same between erasure coding and replication formats, but the times spent by the reduce tasks increased significantly in the erasure coding. Because the main workload of the map of the terasort was reading, on the other hand the main workload of the reduce of the terasort was writing. This results confirms the previous benchmark test of write and read erasure coding files.
  57. Finally, I’ll show the elapsed time of the distcp as a method to convert replication files to erasure coding files.
  58. This is a typical workload to use erasure coding in HDFS. We used our real web log data as a target data. It was 2TB size. This table shows the result. The elapsed time of copying 2TB files to erasure coded files was about 17 minutes. This was about tripled the elapsed time of copying files from replication to replication. But, currently using distcp is best way to convert existing replication files to erasure coding files.
  59. So far I talked about the features and performance of the HDFS erasure coding, and I’ll talk that how are we Yahoo japan using HDFS erasure coding in our data platform.
  60. Our data which is target to erasure coding is 6.2TB every day and we are storing these data up to 400days. We are thinking to convert these data to erasure coding format gradually. And we are using archive storage with erasure coding to reduce more cost of the storage. Now we are using erasure coding in our production cluster on trial. We would convert old data to erasure coding format steadily.
  61. The cost of archive disk which is used in our cluster is about 70% of the cost of normal disk. Furthermore, along with Erasure Coding, the cost would reduce by more 50%. In total, we could reduce storage costs to 35% of tripled replication.
  62. And we are using erasure coded ORC files as Hive input data. This chart is the execution time of the hive query on the erasure coded ORC files. The ORC files are about 2TB and holds 17billion records. The vertical axis shows execution time seconds. The horizontal axis shows each different queries. The red bar is using erasure coded files and the blue bar is using replication files. As you can see, the execution time seemed to increase when the input data is erasure coded files. But it will be acceptable, considering the queries are rarely executed
  63. Finally, I’m going to talk about the future phase of HDFS erasure coding and new features or enhancement of the HDFS erasure coding which we want to import into our clusters.
  64. First, we are going to try to use other codecs such as Intel ISA-Library and Hitchhiker algorithm to improve write performance. And we are also interested in contiguous layout erasure coding which can provide data locality. And we will contribute to implement hflush and hsync. If they were implemented in HDFS erasure coding, we would be able to use erasure coding format in much more scenarios.
  65. At last, I would like to make a conclusion. In this talk, HDFS erasure coding was introduced and the results of the tests in our clusters were shared as well. And thanks to erasure coding. We are really reducing the storage cost in production. It’s ready for production.
  66. Thank you very much for your listening.