This document discusses parallelizing the Berlekamp-Massey algorithm (BMA) using SIMD instructions and GPU streams. It first provides background on the BMA and its use in linear feedback shift registers. It then describes implementing the BMA using SSE and AVX instructions to perform computations on multiple data elements simultaneously. Finally, it discusses parallelizing the BMA on a GPU using multiple streams to concurrently execute kernels, achieving speedups over the serial CPU implementation for long input lengths. Evaluation results show the SIMD CPU implementation outperforms GPU with one stream for inputs under 2^23 bits, while the GPU with 32 streams is fastest for longer inputs.
2. Topics
u BMA Algorithm
u BMA Implementation using SIMD Instructions
u GPU Parallelization using Streams
u Conclusion
3. BMA Algorithm
u Linear Feedback Shift Register
u Due to ease of implementation, LFSR is widely used
u Cryptography
u GSM cell phone
u Bluetooth
u Scrambling
u PCIe
u SATA
u USB
u GbE
1 1 1
4. BMA Algorithm
u Given a finite binary sequence, find a shortest LFSR that generates
the sequence
u Elwyn Berlekamp
u A paper/presentation at International Symposium on Information Theory,
Italy, 1967 [1]
u Algebraic coding theory, McGraw-Hill, 1968 [2]
u James Massey
u Shift-register synthesis and BCH decoding, IEEE Transactions on
Information Theory, 1969 [3]
u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”
6. BMA Algorithm
u Prof.Ouyang Previous Work[4]
u Reverse S
u Pack 32 bits into one word
u Compute inner product
u Count # of 1-bits in a 32-bit word
u Update C(x)
u GPU is faster than CPU reverse bit version for long input length (more
than 2^22)
7. BMA Implementation using SIMD Instructions
• A data parallel architecture
• Applying the same instruction to many data
– Save control logic
– A related architecture is the vector architecture
– SIMD and vector architectures offer high performance for vector
operations.
• SSE (vector of size 4)
8. BMA Implementation using SIMD Instructions
u BMA-SSE
u Using SSE instructions in computation and copying data
u Doing computation for four consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario ¼
), it depends on the input string
u BMA-AVX
u Using AVX instructions in computation and copying data
u Doing computation for eight consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario
1/8 ), it depends on the input string
11. GPU Parallelization using Streams
u Original BMA using GPU Kernels:
u K1, K2, K3, and K4 are kernel functions.
u xi and ki are the time of execution of each part.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
x1 k1 x2 k2 x3 k3 x4 k4 x5
12. GPU Parallelization using Streams
u Ideal BMA using concurrent kernel execution
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x1 k1 k2 k3 k4 n * x5
Serial K1 Serial K2 Serial K3 Serial K4 Serial
.
.
.
.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x2 n * x3 n * x4
13. GPU Parallelization using Streams
u The total of running n serial input is:
u The time of ideal concurrent program:
u The ideal speed up:
14. GPU Parallelization using Streams
u In the practical scenario, the hardware supports the S concurrent streams. So
the time of concurrent is equal to:
u The real speed up is:
19. Conclusion
u The best performance for input length less than 2^23 belong to SSE
implementation on CPU.
u Using more than 2 streams can reduce the execution time and for input
length 2^23 it is faster than SSE and bit implementation with one stream.
u The best performance between the GPU and SIMD implementation for input
length more than 2^23 belong to GPU stream using 32 concurrent streams.
20. References
u [1] Berlekamp, Elwyn R. ,"Nonbinary BCH decoding", International Symposium
on Information Theory, San Remo, Italy, 1967.
u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA:
Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New
York, NY, 1968.
u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans.
Information Theory, IT-15 (1): 122–127, 1969.
u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey
Algorithm. Proceedings of the Second International Conference on Computing,
Measurement, Control and Sensor Network (CMCSN), 2014.