SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
Accelerate Reed-Solomon
Erasure Codes on GPUs
Student: ,
Advisor: Prof. Jerry Chi-Yuan Chou
Shuai Yuan
LSA Lab, NTHU
Outline
Introduction
Overview of Our Acceleration Targets
Accelerating Operations in Galois Field
Accelerating Matrix Multiplication
Reducing Data Transfer Overhead
Experiment
Conclusion
Introduction
Erasure Codes
A redundancy solution for fault-tolerance, widely used in:
signals from deep space satellites
reliable multimedia multicasting
reliable storage systems
Cloud Storage Systems: Why
Redundancy?
Cloud storage vendors claim to provide highly available and
reliable services in their SLA (Service-Level Agreement) with
the customers. Both availability and reliability imply strict
fault tolerance requirements for cloud storage system.
However, as the scale of storage system grows larger and
larger, the probability of failure becomes significant:
→ SLA violation.
Therefore, redundancy must be introduced into cloud system.
Cloud Storage Systems: Why
Redundancy?
The simplest and straightforward redundancy solution is
replication of the data in multiple storage nodes.
Triple replication solution have been favored in cloud storage
systems like the GFS (Google File System) and HDFS
(Hadoop Distributed File System).
Cloud Storage Systems: Why Erasure
Codes?
Problem of replication: large storage overhead.
Erasure codes can reduce the storage overhead significantly
while at the same time maintaining the same level of fault
tolerance as replication → a better redundancy solution.
Cloud Storage Systems: Why Erasure
Codes?
MDS (Maximum Distance Separable) codes where and
are integers and :
(n, k) n
k n > k
File → equal-size native data chunks.
equal-size native chunks → code chunks.
The native and code chunks are distributed on different
storage nodes.
tolerates the failure of any storage nodes.
k
k (n − k)
n
(n − k)
Example:
Reed-Solomon Code Triple Replication(n = 4, k = 2)
Cloud Storage Systems: Why Reed-
Solomon Codes?
Reed-Solomon codes are one of the most
popular MDS erasure codes.
is used in HDFS-RAID in Facebook and
is used in GFS II in Google.
RS(k, n − k)
RS(10, 4) RS(6, 3)
Shortcomings of Reed-Solomon
Codes
extra high computation cost compared to replication: encoding
and decoding.
Our contributions: use GPU to accelerate the Reed-Solomon
encoding and decoding.
Overview of Our
Acceleration Targets
Reed-Solomon Code Overview
Reed-Solomon Encoding Reed-Solomon Decoding
Acceleration Targets - 1
Computation bottleneck: matrix multiplication.
≡
C = A ⋅ B
( = × )cj ∑
i=1
k
ai,j bi
Addition and multiplication are defined as arithmetic over
Galois Field GF( ).2
8
Acceleration targets for computation:
Arithmetic over Galois Field.
Matrix multiplication.
GPU-Accelerated Reed-Solomon
Code Overview
Reed-Solomon
Encoding in a
GPU
Reed-Solomon
Decoding in a
GPU
Acceleration Targets - 2
Extra overhead in GPU implementation: data transfers between
CPU and GPU.
Another acceleration targets: reducing data transfer overhead.
Overview of Our
Acceleration Targets
Accelerating Arithmetic
over Galois Field
Brief Introduction of Galois Field
GF( ) contains polynomials. For each polynomial, its
degree is at most and its coefficient is in {0, 1}.
2
w
2
w
w − 1
For GF( ), every element can be one-to-one mapped into a
byte, and polynomial operations in GF( ) is isomorphic to
operations on bytes:
2
8
2
8
Addition: isomorphic to bitwise XOR → inexpensive.
Multiplication: isomorphic to bitwise operations → still time-
consuming.
Therefore, how to accelerate multiplication over GF( ) will be
our focus.
2
8
GPU Accelerating Options for
Multiplication
loop-based method
a set of table-based methods
Loop-based Method
compute directly: cost a loop of at most eight iterations to
multiply two elements in GF( ).
computation bound
2
8
Table-based Methods
precompute and store the results in tables
e.g. Full Multiplication Table
38 39 40
20 194 214 26
21 228 241 50
22 142 152 74
… …
⋮ … … … … …
… …
… …
… …
⋮ … … … … …
Table-based Methods
Full
Multiplication
Table
"Double
Table"/"Left-
Right Table"
Log&Exp
Table
Space
Complexity
for GF( )
Computation
Complexity
one table-
lookup
2 table-
lookup, 2
AND, 1 XOR,
and 1 SHIFT
3 table-lookup,
1 MOD, 1
ADD, and 2
branches
Memory
space for GF(
)
64 KB 8 KB 512 Bytes
2
w
O( )2
2w
O( )2
3w/2+1
O( )2
w+1
2
8
Use log&exp table-based method in our GPU implementation.
GPU Implementation: Loop-based or Table-based?
The loop-based method is able to achieve the maximum bandwidth even when the chunk size is
small.
The bandwidth of the table-based method can still scale up as the chunk size grows larger.
The maximum bandwidth of the table-based method exceeds that of the loop-based method.
[1] Plank J S, Xu L. OptimizingCauchy Reed-Solomon codes for fault-tolerant network storage applications[C]//NCA 2006.
[2] Shojania H, Li B, WangX. Nuclei: GPU-acceleratedmany-core network coding[C]//INFOCOM 2009.
[3] Kalcher S, Lindenstruth V. AcceleratingGalois Fieldarithmeticfor Reed-Solomon erasure codes in storage applications[C]//Cluster
Computing(CLUSTER) 2011.
[4] Chu X, Zhao K. Practical random linear network codingon GPUs[M]//GPU Solutions to Multi-scale Problems in Science and
Engineering. Springer Berlin Heidelberg, 2013: 115-130.
Further Improvement of the
Log&exp Table-based Method
baseline:
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
if (a == 0 || b == 0) {
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) % (NW-1);
return gf256_exp_table[result];
}
Further Improvement of the
Log&exp Table-based Method
Improvement
Approach 1
Replace the slow modular operations with
more efficient operations.
Improvement
Approach 2
Remove the slow modular operations by
augmenting one table.
Improvement
Approach 3
Further eliminates the conditional branches
by augmenting both two tables.
if (a == 0 || b == 0) {
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1)
+ (gf256_log_table[a] + gf256_log_table[b]) >> width;
if (a == 0 || b == 0) {
return 0;
}
result = gf256_log_table[a] + gf256_log_table[b];
result = gf256_log_table[a] + gf256_log_table[b];
Further Improvement of the
Log&exp Table-based Method
In GPU implementation, where to store the tables and how to
initialize them will affect the performance.
Further Improvement of the
Log&exp Table-based Method
Appropriate GPU memory:
constant memory: off-chip memory whose accesses are
usually cached in the constant cache.
shared memory: on-chip memory which has the smallest
access latency except the register file.
Further Improvement of the
Log&exp Table-based Method
What we have implemented:
Store two tables in the constant memory and initialize them
at compile time.
Store two tables in the shared memory and run-time initialize
them serially at the beginning of each kernel function.
Store two tables in the off-chip memory and then load into
the shared memory parallely at the beginning of each kernel
function.
Further Improvement of the Log&exp Table-based Method
encoding a 1GB file with and .k = 4 n = 6
The elimination of conditional branches improves the
performance.
Accessing the tables in the constant memory is more time-
consuming.
Accelerating Matrix
Multiplication
Naive Implementation
Naive Implementation
Naive Implementation
Naive Implementation
Problems of Naive Implementation
a lot of global memory (off-chip) transactions: each takes 400
- 800 clock cycles.
memory access in column major → poor temporal locality
and high cache missrate.
Square-Tiling Algorithm
Problems of Square-Tiling Algorithm
not suitable for general input cases of Reed-Solomon Codes:
a small matrix multiple a huge matrix
encoding--A: , B: .
decoding--A: , B: .
where
: total chunk number (1-100)
: native chunk number (1-100)
: chunk size (more than 10,000,000)
(n − k) × k k × CS
k × k k × CS
n
k
CS
Generalized Tiling Algorithm
How to Determine the Parameter of Tiles
tileDepth
tileWidthRow
tileWidthCol
How to Determine the Parameter of Tiles
set tileDepth to ( ) → remove the loop of accessing
matrix and tile by tile.
A. width k
A B
How to Determine the Parameter of Tiles
set tileDepth to → remove the loop of accessing matrix
and tile by tile.
k A
B
How to Determine the Parameter of Tiles
set tileDepth to → remove the loop of accessing matrix
and tile by tile.
k A
B
How to Determine the Parameter of Tiles
is equal to the CUDA
block size (number of threads in a CUDA block) → find the
best CUDA block size by tuning occupancy.
Finally we need to further determine tileWidthRow and
tileWidthCol.
tileWidthRow × tileWidthCol
How to Determine the Parameter of Tiles
Define the aspect ratio of the tile in the result matrix as:AR
AR =
tileWidthRow
tileWidthCol
three strategies:
Min minimize tileWidthRow,
maximize tileWidthCol
Max maximize tileWidthRow,
minimize tileWidthCol
AR
AR = 1 tileWidthRow = tileWidthCol
AR
How to Determine the Parameter of Tiles
use the following testcases for encoding:
TESTCASE CHUNK
SIZE (MB)
DESCRIPTION
FILE
SIZE
CODE CHUNK
NUMBER
1 2 200 1 small large
2 4 150 1
3 4 200 1
4 2 200 10 large large
5 4 150 10
6 4 200 10
7 4 6 256 large small
8 8 16 128
k n
How to Determine the Parameter of Tiles
The first six testcases, especially the 4th-6th ones, represent
the worse cases for the strategy "Max ".
The last two testcases represent the worse cases for the
strategy " ".
Overall, the strategy "Min " is the best.
AR
AR = 1
AR
How to Determine the Parameter of Tiles
Generally, the "Min " strategy has the least number of
conditional branches, while the "Max " has the most. The
overhead of branches is significant when encoding a large
file into a large number of code chunks.
AR
AR
How to Determine the Parameter of Tiles
The " " strategy has significant performance
degradation when the number of code chunks is smaller than
tileWidthRow (wasting cache space, extra boundary
checking).
AR = 1
Further Improvement of Tiling
Algorithm
When each chunk can be word-aligned, we can further improve
the tiling algorithm.
Observation 1: Byte-length Matrix Multiplication
Multiplication: table look-ups → memory accesses
Addition: 8-bit XORs -> ALU operations
Each ALU operation:
operand1 (8-bit) XOR operand2 (8-bit) = result (8-bit)
GPU Execution Units: 32-bit
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Observation 2: Global Memory Transactions
a lot of 8-bit global memory transactions!
Improvement Technique 2: Packing Memory Transactions
pack 8-bit global memory transactions into 32-bit transactions
→ reduce global memory load and store.
Further Improvement of Tiling
Algorithm
e.g. , , chunk size = 200 MBk = 4 n = 200
Original Word-
alignment
Improvement
Effective
bandwidth (MB/s)
1119.419693 1591.051924 42.13%
ALU Operations 12331000000 3116531712 74.73%
Global Memory
Load Transactions
516963328 131611648 74.54%
Global Memory
Store Transactions
64225280 16056320 75%
Number of
Branches
6685685440 2703981504 59.56%
Reducing Data Transfer
Overhead
Using CUDA Streams
Using CUDA Streams
CUDA streams are used for further overlapping data transfers
with computation.
Sequential vs. Concurrent copy and execute
Using CUDA Streams
encoding under settings. The input file size is scaled from 1000 MB to 1800 MB.k = 4, n = 6
Using CUDA streaming can improve the performance by
more than 29%.
As the file size increases, the data transfer overhead
increases, and the improvement of CUDA streaming is more
significant.
Using CUDA Streams
encoding a 2000 MB file under settings.k = 4, n = 6
The kernel execution time is increasing with the growth of the
CUDA stream number. → The overhead of kernel execution
will exceed the time saving of data transfer at some point.
Experiment
Experiment Setup
Experiment Setup
CentOS-6 with 2.6.32 Linux kernel.
Intel Xeon Processor E5-2670 v2 x 2
10 cores
2.5 GHz
NVIDIA Tesla K20X GPU x 2
2688 CUDA cores
peak performance: 1.31 Tflops (double precision floating
point calculation) and 3.95 Tflops (single precision floating
point)
maximum size of GPU GDDR5 memory: 6 GB
theoretical memory bandwidth 243 GB/s
two copy engines → supports concurrent data copy and
kernel execution
maximum bidirectional bandwidth of the PCI-Express bus: 8
GB/s
Experiment Setup
Input files are randomly generated.
Most of our experimental results reflect the average of 100
runs.
Due to the similarity of the performance result of encoding
and that of decoding in most experiments, the latter one is
omitted.
Overall Performance Evaluation
We evaluate the overall performance by encoding a 1600 MB
file with .k = 4, n = 6
Overall Performance Evaluation
Step-by-step Improvement
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped Data
Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling
algorithm
1.469 x 0 1.469 x 0
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
parallel in 2GPUs 2.030 x 2.857 x 3.861 x 52.718 x
Total Cumulative Speedup: 7.145 x
Overall Performance Evaluation
GPU vs. CPU
best CPU implementation (Jerasure library, compiled by
clang with the -O3 compiler optimization flag): 4309.08 ms.
optimized GPU implementation: 292.977 ms (
speedup).
14.71 ×
Conclusion
We have studied several techniques to improve the
performance of Reed-Solomon codes according to their
coding mechanism, and figured out the best choices on the
basis of GPU architecture.
We have illustrated methods to reduce the data transfer
overhead introduced by the GPU implementation.
We present an optimized GPU implementation of Reed-
Solomon Codes, which can achieve a speedup of 14.71 over
the current best CPU implementation.
Future Works
Better corporation of CPU and GPU.
Heuristic strategies for deciding the best CUDA stream
number.
THE END
Q & A
Technical details:
repo open under GPL v3 license:
http://galoisplusplus.gitcafe.com/blog/2014/06/23/optimizing-
special-matrix-multiplication-in-cuda/
https://github.com/yszheda/GPU-RSCode

Weitere ähnliche Inhalte

Was ist angesagt?

USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaAltinity Ltd
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...Erick Ornio
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
NAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estadoNAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estadoCarlos Martinez Cagnazzo
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraEmiliano
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術Ryousei Takano
 

Was ist angesagt? (20)

USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
NAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estadoNAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estado
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 

Andere mochten auch

The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua tableShuai Yuan
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Finland presentation
Finland presentationFinland presentation
Finland presentationhatice ekiz
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...Remedy IT
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Maxim Birger (马克斯)
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore jamarch
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Sarah Cruz
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerYaniv Bronhaim
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPCinside-BigData.com
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsCarlos de Alfonso Laguna
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsJean-Frederic Clere
 

Andere mochten auch (20)

The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua table
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Finland presentation
Finland presentationFinland presentation
Finland presentation
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...
 
English nibelungenlied
English nibelungenliedEnglish nibelungenlied
English nibelungenlied
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
 
Android things intro
Android things introAndroid things intro
Android things intro
 
oVirt – open your virtual datacenter
oVirt – open your virtual datacenteroVirt – open your virtual datacenter
oVirt – open your virtual datacenter
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
 
Nibelungenlied
NibelungenliedNibelungenlied
Nibelungenlied
 
Spark CL
Spark CLSpark CL
Spark CL
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Interview preparation workshop
Interview preparation workshopInterview preparation workshop
Interview preparation workshop
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloads
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projects
 
Journée portrait du 23 fevrier
Journée portrait du 23 fevrierJournée portrait du 23 fevrier
Journée portrait du 23 fevrier
 

Ähnlich wie Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014Công Lợi Dương
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldAllan Cantle
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 

Ähnlich wie Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system (20)

Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 

Kürzlich hochgeladen

Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Alexander Turgeon
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Kürzlich hochgeladen (20)

Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

  • 1. Accelerate Reed-Solomon Erasure Codes on GPUs Student: , Advisor: Prof. Jerry Chi-Yuan Chou Shuai Yuan LSA Lab, NTHU
  • 2. Outline Introduction Overview of Our Acceleration Targets Accelerating Operations in Galois Field Accelerating Matrix Multiplication Reducing Data Transfer Overhead Experiment Conclusion
  • 4. Erasure Codes A redundancy solution for fault-tolerance, widely used in: signals from deep space satellites reliable multimedia multicasting reliable storage systems
  • 5. Cloud Storage Systems: Why Redundancy? Cloud storage vendors claim to provide highly available and reliable services in their SLA (Service-Level Agreement) with the customers. Both availability and reliability imply strict fault tolerance requirements for cloud storage system. However, as the scale of storage system grows larger and larger, the probability of failure becomes significant: → SLA violation. Therefore, redundancy must be introduced into cloud system.
  • 6. Cloud Storage Systems: Why Redundancy? The simplest and straightforward redundancy solution is replication of the data in multiple storage nodes. Triple replication solution have been favored in cloud storage systems like the GFS (Google File System) and HDFS (Hadoop Distributed File System).
  • 7. Cloud Storage Systems: Why Erasure Codes? Problem of replication: large storage overhead. Erasure codes can reduce the storage overhead significantly while at the same time maintaining the same level of fault tolerance as replication → a better redundancy solution.
  • 8. Cloud Storage Systems: Why Erasure Codes? MDS (Maximum Distance Separable) codes where and are integers and : (n, k) n k n > k File → equal-size native data chunks. equal-size native chunks → code chunks. The native and code chunks are distributed on different storage nodes. tolerates the failure of any storage nodes. k k (n − k) n (n − k) Example: Reed-Solomon Code Triple Replication(n = 4, k = 2)
  • 9. Cloud Storage Systems: Why Reed- Solomon Codes? Reed-Solomon codes are one of the most popular MDS erasure codes. is used in HDFS-RAID in Facebook and is used in GFS II in Google. RS(k, n − k) RS(10, 4) RS(6, 3)
  • 10. Shortcomings of Reed-Solomon Codes extra high computation cost compared to replication: encoding and decoding. Our contributions: use GPU to accelerate the Reed-Solomon encoding and decoding.
  • 12. Reed-Solomon Code Overview Reed-Solomon Encoding Reed-Solomon Decoding
  • 13. Acceleration Targets - 1 Computation bottleneck: matrix multiplication. ≡ C = A ⋅ B ( = × )cj ∑ i=1 k ai,j bi Addition and multiplication are defined as arithmetic over Galois Field GF( ).2 8 Acceleration targets for computation: Arithmetic over Galois Field. Matrix multiplication.
  • 14. GPU-Accelerated Reed-Solomon Code Overview Reed-Solomon Encoding in a GPU Reed-Solomon Decoding in a GPU
  • 15. Acceleration Targets - 2 Extra overhead in GPU implementation: data transfers between CPU and GPU. Another acceleration targets: reducing data transfer overhead.
  • 18. Brief Introduction of Galois Field GF( ) contains polynomials. For each polynomial, its degree is at most and its coefficient is in {0, 1}. 2 w 2 w w − 1 For GF( ), every element can be one-to-one mapped into a byte, and polynomial operations in GF( ) is isomorphic to operations on bytes: 2 8 2 8 Addition: isomorphic to bitwise XOR → inexpensive. Multiplication: isomorphic to bitwise operations → still time- consuming. Therefore, how to accelerate multiplication over GF( ) will be our focus. 2 8
  • 19. GPU Accelerating Options for Multiplication loop-based method a set of table-based methods
  • 20. Loop-based Method compute directly: cost a loop of at most eight iterations to multiply two elements in GF( ). computation bound 2 8
  • 21. Table-based Methods precompute and store the results in tables e.g. Full Multiplication Table 38 39 40 20 194 214 26 21 228 241 50 22 142 152 74 … … ⋮ … … … … … … … … … … … ⋮ … … … … …
  • 22. Table-based Methods Full Multiplication Table "Double Table"/"Left- Right Table" Log&Exp Table Space Complexity for GF( ) Computation Complexity one table- lookup 2 table- lookup, 2 AND, 1 XOR, and 1 SHIFT 3 table-lookup, 1 MOD, 1 ADD, and 2 branches Memory space for GF( ) 64 KB 8 KB 512 Bytes 2 w O( )2 2w O( )2 3w/2+1 O( )2 w+1 2 8 Use log&exp table-based method in our GPU implementation.
  • 23. GPU Implementation: Loop-based or Table-based? The loop-based method is able to achieve the maximum bandwidth even when the chunk size is small. The bandwidth of the table-based method can still scale up as the chunk size grows larger. The maximum bandwidth of the table-based method exceeds that of the loop-based method. [1] Plank J S, Xu L. OptimizingCauchy Reed-Solomon codes for fault-tolerant network storage applications[C]//NCA 2006. [2] Shojania H, Li B, WangX. Nuclei: GPU-acceleratedmany-core network coding[C]//INFOCOM 2009. [3] Kalcher S, Lindenstruth V. AcceleratingGalois Fieldarithmeticfor Reed-Solomon erasure codes in storage applications[C]//Cluster Computing(CLUSTER) 2011. [4] Chu X, Zhao K. Practical random linear network codingon GPUs[M]//GPU Solutions to Multi-scale Problems in Science and Engineering. Springer Berlin Heidelberg, 2013: 115-130.
  • 24. Further Improvement of the Log&exp Table-based Method baseline: uint8_t gf256_mul(uint8_t a, uint8_t b) { int result; if (a == 0 || b == 0) { return 0; } result = (gf256_log_table[a] + gf256_log_table[b]) % (NW-1); return gf256_exp_table[result]; }
  • 25. Further Improvement of the Log&exp Table-based Method Improvement Approach 1 Replace the slow modular operations with more efficient operations. Improvement Approach 2 Remove the slow modular operations by augmenting one table. Improvement Approach 3 Further eliminates the conditional branches by augmenting both two tables. if (a == 0 || b == 0) { return 0; } result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1) + (gf256_log_table[a] + gf256_log_table[b]) >> width; if (a == 0 || b == 0) { return 0; } result = gf256_log_table[a] + gf256_log_table[b]; result = gf256_log_table[a] + gf256_log_table[b];
  • 26. Further Improvement of the Log&exp Table-based Method In GPU implementation, where to store the tables and how to initialize them will affect the performance.
  • 27. Further Improvement of the Log&exp Table-based Method Appropriate GPU memory: constant memory: off-chip memory whose accesses are usually cached in the constant cache. shared memory: on-chip memory which has the smallest access latency except the register file.
  • 28. Further Improvement of the Log&exp Table-based Method What we have implemented: Store two tables in the constant memory and initialize them at compile time. Store two tables in the shared memory and run-time initialize them serially at the beginning of each kernel function. Store two tables in the off-chip memory and then load into the shared memory parallely at the beginning of each kernel function.
  • 29. Further Improvement of the Log&exp Table-based Method encoding a 1GB file with and .k = 4 n = 6 The elimination of conditional branches improves the performance. Accessing the tables in the constant memory is more time- consuming.
  • 35. Problems of Naive Implementation a lot of global memory (off-chip) transactions: each takes 400 - 800 clock cycles. memory access in column major → poor temporal locality and high cache missrate.
  • 37. Problems of Square-Tiling Algorithm not suitable for general input cases of Reed-Solomon Codes: a small matrix multiple a huge matrix encoding--A: , B: . decoding--A: , B: . where : total chunk number (1-100) : native chunk number (1-100) : chunk size (more than 10,000,000) (n − k) × k k × CS k × k k × CS n k CS
  • 38. Generalized Tiling Algorithm How to Determine the Parameter of Tiles tileDepth tileWidthRow tileWidthCol
  • 39. How to Determine the Parameter of Tiles set tileDepth to ( ) → remove the loop of accessing matrix and tile by tile. A. width k A B
  • 40. How to Determine the Parameter of Tiles set tileDepth to → remove the loop of accessing matrix and tile by tile. k A B
  • 41. How to Determine the Parameter of Tiles set tileDepth to → remove the loop of accessing matrix and tile by tile. k A B
  • 42. How to Determine the Parameter of Tiles is equal to the CUDA block size (number of threads in a CUDA block) → find the best CUDA block size by tuning occupancy. Finally we need to further determine tileWidthRow and tileWidthCol. tileWidthRow × tileWidthCol
  • 43. How to Determine the Parameter of Tiles Define the aspect ratio of the tile in the result matrix as:AR AR = tileWidthRow tileWidthCol three strategies: Min minimize tileWidthRow, maximize tileWidthCol Max maximize tileWidthRow, minimize tileWidthCol AR AR = 1 tileWidthRow = tileWidthCol AR
  • 44. How to Determine the Parameter of Tiles use the following testcases for encoding: TESTCASE CHUNK SIZE (MB) DESCRIPTION FILE SIZE CODE CHUNK NUMBER 1 2 200 1 small large 2 4 150 1 3 4 200 1 4 2 200 10 large large 5 4 150 10 6 4 200 10 7 4 6 256 large small 8 8 16 128 k n
  • 45. How to Determine the Parameter of Tiles The first six testcases, especially the 4th-6th ones, represent the worse cases for the strategy "Max ". The last two testcases represent the worse cases for the strategy " ". Overall, the strategy "Min " is the best. AR AR = 1 AR
  • 46. How to Determine the Parameter of Tiles Generally, the "Min " strategy has the least number of conditional branches, while the "Max " has the most. The overhead of branches is significant when encoding a large file into a large number of code chunks. AR AR
  • 47. How to Determine the Parameter of Tiles The " " strategy has significant performance degradation when the number of code chunks is smaller than tileWidthRow (wasting cache space, extra boundary checking). AR = 1
  • 48. Further Improvement of Tiling Algorithm When each chunk can be word-aligned, we can further improve the tiling algorithm.
  • 49. Observation 1: Byte-length Matrix Multiplication Multiplication: table look-ups → memory accesses Addition: 8-bit XORs -> ALU operations Each ALU operation: operand1 (8-bit) XOR operand2 (8-bit) = result (8-bit) GPU Execution Units: 32-bit
  • 50. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 51. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 52. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 53. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 54. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 55. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 56. Observation 2: Global Memory Transactions a lot of 8-bit global memory transactions!
  • 57. Improvement Technique 2: Packing Memory Transactions pack 8-bit global memory transactions into 32-bit transactions → reduce global memory load and store.
  • 58. Further Improvement of Tiling Algorithm e.g. , , chunk size = 200 MBk = 4 n = 200 Original Word- alignment Improvement Effective bandwidth (MB/s) 1119.419693 1591.051924 42.13% ALU Operations 12331000000 3116531712 74.73% Global Memory Load Transactions 516963328 131611648 74.54% Global Memory Store Transactions 64225280 16056320 75% Number of Branches 6685685440 2703981504 59.56%
  • 61. Using CUDA Streams CUDA streams are used for further overlapping data transfers with computation. Sequential vs. Concurrent copy and execute
  • 62. Using CUDA Streams encoding under settings. The input file size is scaled from 1000 MB to 1800 MB.k = 4, n = 6 Using CUDA streaming can improve the performance by more than 29%. As the file size increases, the data transfer overhead increases, and the improvement of CUDA streaming is more significant.
  • 63. Using CUDA Streams encoding a 2000 MB file under settings.k = 4, n = 6 The kernel execution time is increasing with the growth of the CUDA stream number. → The overhead of kernel execution will exceed the time saving of data transfer at some point.
  • 66. Experiment Setup CentOS-6 with 2.6.32 Linux kernel. Intel Xeon Processor E5-2670 v2 x 2 10 cores 2.5 GHz NVIDIA Tesla K20X GPU x 2 2688 CUDA cores peak performance: 1.31 Tflops (double precision floating point calculation) and 3.95 Tflops (single precision floating point) maximum size of GPU GDDR5 memory: 6 GB theoretical memory bandwidth 243 GB/s two copy engines → supports concurrent data copy and kernel execution maximum bidirectional bandwidth of the PCI-Express bus: 8 GB/s
  • 67. Experiment Setup Input files are randomly generated. Most of our experimental results reflect the average of 100 runs. Due to the similarity of the performance result of encoding and that of decoding in most experiments, the latter one is omitted.
  • 68. Overall Performance Evaluation We evaluate the overall performance by encoding a 1600 MB file with .k = 4, n = 6
  • 70. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0
  • 71. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0
  • 72. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0 use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
  • 73. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0 use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x parallel in 2GPUs 2.030 x 2.857 x 3.861 x 52.718 x Total Cumulative Speedup: 7.145 x
  • 74. Overall Performance Evaluation GPU vs. CPU best CPU implementation (Jerasure library, compiled by clang with the -O3 compiler optimization flag): 4309.08 ms. optimized GPU implementation: 292.977 ms ( speedup). 14.71 ×
  • 75. Conclusion We have studied several techniques to improve the performance of Reed-Solomon codes according to their coding mechanism, and figured out the best choices on the basis of GPU architecture. We have illustrated methods to reduce the data transfer overhead introduced by the GPU implementation. We present an optimized GPU implementation of Reed- Solomon Codes, which can achieve a speedup of 14.71 over the current best CPU implementation. Future Works Better corporation of CPU and GPU. Heuristic strategies for deciding the best CUDA stream number.
  • 76. THE END Q & A Technical details: repo open under GPL v3 license: http://galoisplusplus.gitcafe.com/blog/2014/06/23/optimizing- special-matrix-multiplication-in-cuda/ https://github.com/yszheda/GPU-RSCode