SlideShare a Scribd company logo
1 of 20
Download to read offline
Parallelization of the
Berlekamp-Massey Algorithm
Using
SIMD Instructions
and
GPU Streams
Presented By: Hamidreza Mohebbi
Advisor: Ming Ouyang
Topics
u BMA Algorithm
u BMA Implementation using SIMD Instructions
u GPU Parallelization using Streams
u Conclusion
BMA Algorithm
u Linear Feedback Shift Register
u Due to ease of implementation, LFSR is widely used
u Cryptography
u GSM cell phone
u Bluetooth
u Scrambling
u PCIe
u SATA
u USB
u GbE
1 1 1
BMA Algorithm
u Given a finite binary sequence, find a shortest LFSR that generates
the sequence
u Elwyn Berlekamp
u A paper/presentation at International Symposium on Information Theory,
Italy, 1967 [1]
u Algebraic coding theory, McGraw-Hill, 1968 [2]
u James Massey
u Shift-register synthesis and BCH decoding, IEEE Transactions on
Information Theory, 1969 [3]
u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”
BMA Algorithm
BMA Algorithm
u Prof.Ouyang Previous Work[4]
u Reverse S
u Pack 32 bits into one word
u Compute inner product
u Count # of 1-bits in a 32-bit word
u Update C(x)
u GPU is faster than CPU reverse bit version for long input length (more
than 2^22)
BMA Implementation using SIMD Instructions
• A data parallel architecture
• Applying the same instruction to many data
– Save control logic
– A related architecture is the vector architecture
– SIMD and vector architectures offer high performance for vector
operations.
• SSE (vector of size 4)
BMA Implementation using SIMD Instructions
u BMA-SSE
u Using SSE instructions in computation and copying data
u Doing computation for four consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario ¼
), it depends on the input string
u BMA-AVX
u Using AVX instructions in computation and copying data
u Doing computation for eight consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario
1/8 ), it depends on the input string
BMA Implementation using SIMD Instructions
0
500
1000
1500
2000
2500
3000
3500
4000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
input: random
compile option: -O3
BMA Implementation using SIMD Instructions
0
1000
2000
3000
4000
5000
6000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
BMAAVX
input: random
compile option: -O1
GPU Parallelization using Streams
u Original BMA using GPU Kernels:
u K1, K2, K3, and K4 are kernel functions.
u xi and ki are the time of execution of each part.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
x1 k1 x2 k2 x3 k3 x4 k4 x5
GPU Parallelization using Streams
u Ideal BMA using concurrent kernel execution
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x1 k1 k2 k3 k4 n * x5
Serial K1 Serial K2 Serial K3 Serial K4 Serial
.
.
.
.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x2 n * x3 n * x4
GPU Parallelization using Streams
u The total of running n serial input is:
u The time of ideal concurrent program:
u The ideal speed up:
GPU Parallelization using Streams
u In the practical scenario, the hardware supports the S concurrent streams. So
the time of concurrent is equal to:
u The real speed up is:
GPU Parallelization using Streams
0
20
40
60
80
100
120
140
160
180
200
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^16
GPU Parallelization using Streams
0
500
1000
1500
2000
2500
3000
3500
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^20
GPU Parallelization using Streams
0
2000
4000
6000
8000
10000
12000
14000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^22
GPU Parallelization using Streams
0
5000
10000
15000
20000
25000
30000
35000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^23
Conclusion
u The best performance for input length less than 2^23 belong to SSE
implementation on CPU.
u Using more than 2 streams can reduce the execution time and for input
length 2^23 it is faster than SSE and bit implementation with one stream.
u The best performance between the GPU and SIMD implementation for input
length more than 2^23 belong to GPU stream using 32 concurrent streams.
References
u [1] Berlekamp, Elwyn R. ,"Nonbinary BCH decoding", International Symposium
on Information Theory, San Remo, Italy, 1967.
u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA:
Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New
York, NY, 1968.
u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans.
Information Theory, IT-15 (1): 122–127, 1969.
u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey
Algorithm. Proceedings of the Second International Conference on Computing,
Measurement, Control and Sensor Network (CMCSN), 2014.

More Related Content

What's hot

Satellite link-budget-analysis-matlab-code
Satellite link-budget-analysis-matlab-codeSatellite link-budget-analysis-matlab-code
Satellite link-budget-analysis-matlab-codevijaya selvan sundaram
 
Multiprocessor communications
Multiprocessor communicationsMultiprocessor communications
Multiprocessor communicationsusic123
 
Assembly lab up to 6 up (1)
Assembly lab up to 6 up (1)Assembly lab up to 6 up (1)
Assembly lab up to 6 up (1)ilias ahmed
 
XPDS14: Linux on Xen: A Status Update - David Vrabel, Citrix
XPDS14: Linux on Xen: A Status Update - David Vrabel, CitrixXPDS14: Linux on Xen: A Status Update - David Vrabel, Citrix
XPDS14: Linux on Xen: A Status Update - David Vrabel, CitrixThe Linux Foundation
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Codemotion
 
Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)Sneeker Yeh
 

What's hot (11)

Satellite link-budget-analysis-matlab-code
Satellite link-budget-analysis-matlab-codeSatellite link-budget-analysis-matlab-code
Satellite link-budget-analysis-matlab-code
 
Comms Poster
Comms PosterComms Poster
Comms Poster
 
Sysinfo
SysinfoSysinfo
Sysinfo
 
Beautiful History
Beautiful HistoryBeautiful History
Beautiful History
 
Multiprocessor communications
Multiprocessor communicationsMultiprocessor communications
Multiprocessor communications
 
Dsp ic june2013 (2)
Dsp ic june2013 (2)Dsp ic june2013 (2)
Dsp ic june2013 (2)
 
Presentation
PresentationPresentation
Presentation
 
Assembly lab up to 6 up (1)
Assembly lab up to 6 up (1)Assembly lab up to 6 up (1)
Assembly lab up to 6 up (1)
 
XPDS14: Linux on Xen: A Status Update - David Vrabel, Citrix
XPDS14: Linux on Xen: A Status Update - David Vrabel, CitrixXPDS14: Linux on Xen: A Status Update - David Vrabel, Citrix
XPDS14: Linux on Xen: A Status Update - David Vrabel, Citrix
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 
Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)
 

Viewers also liked (16)

Ano -unidades 5-
Ano -unidades 5-Ano -unidades 5-
Ano -unidades 5-
 
Megatendencias corregido
Megatendencias corregidoMegatendencias corregido
Megatendencias corregido
 
Trabajo de computacion
Trabajo de computacionTrabajo de computacion
Trabajo de computacion
 
eSalud2013
eSalud2013eSalud2013
eSalud2013
 
Descubrimiento de américa
Descubrimiento de américaDescubrimiento de américa
Descubrimiento de américa
 
Em sa d col progreso
Em sa d col progresoEm sa d col progreso
Em sa d col progreso
 
Apresentacao sporttown
Apresentacao sporttownApresentacao sporttown
Apresentacao sporttown
 
Sueños de escarabajo
Sueños de escarabajoSueños de escarabajo
Sueños de escarabajo
 
B)ciencia
B)cienciaB)ciencia
B)ciencia
 
I)repaso ii de matematicas teorema de pitagoras
I)repaso ii de matematicas teorema de pitagorasI)repaso ii de matematicas teorema de pitagoras
I)repaso ii de matematicas teorema de pitagoras
 
Las riquezas de mi perú
Las riquezas de mi perúLas riquezas de mi perú
Las riquezas de mi perú
 
Personal directivo y cuerpo técnico docente
Personal directivo y cuerpo técnico docentePersonal directivo y cuerpo técnico docente
Personal directivo y cuerpo técnico docente
 
CigarCityBrewingCompanyOrg.Assessment
CigarCityBrewingCompanyOrg.AssessmentCigarCityBrewingCompanyOrg.Assessment
CigarCityBrewingCompanyOrg.Assessment
 
Berca Marketing Profile
Berca Marketing ProfileBerca Marketing Profile
Berca Marketing Profile
 
IA Report
IA ReportIA Report
IA Report
 
Macro
 Macro Macro
Macro
 

Similar to BMA

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Digital to analog convertor
Digital to analog convertorDigital to analog convertor
Digital to analog convertorsartaj ahmed
 
Programming pic microcontrollers
Programming pic microcontrollersProgramming pic microcontrollers
Programming pic microcontrollersMAIYO JOSPHAT
 
igorFreire_UCI_real-time-dsp_reports
igorFreire_UCI_real-time-dsp_reportsigorFreire_UCI_real-time-dsp_reports
igorFreire_UCI_real-time-dsp_reportsIgor Freire
 
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...IJERA Editor
 
M3488 Application Note
M3488 Application NoteM3488 Application Note
M3488 Application NotePiero Belforte
 
Revelation pyconuk2016
Revelation pyconuk2016Revelation pyconuk2016
Revelation pyconuk2016Sarah Mount
 
Design of Counter Using SRAM
Design of Counter Using SRAMDesign of Counter Using SRAM
Design of Counter Using SRAMIOSRJECE
 
Arduino Microcontroller
Arduino MicrocontrollerArduino Microcontroller
Arduino MicrocontrollerShyam Mohan
 
digital design of communication systems
digital design of communication systemsdigital design of communication systems
digital design of communication systemsanishgoel
 

Similar to BMA (20)

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Digital to analog convertor
Digital to analog convertorDigital to analog convertor
Digital to analog convertor
 
M3488 datasheet
M3488 datasheetM3488 datasheet
M3488 datasheet
 
Day2
Day2Day2
Day2
 
SRAM Design
SRAM DesignSRAM Design
SRAM Design
 
Programming pic microcontrollers
Programming pic microcontrollersProgramming pic microcontrollers
Programming pic microcontrollers
 
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
 
igorFreire_UCI_real-time-dsp_reports
igorFreire_UCI_real-time-dsp_reportsigorFreire_UCI_real-time-dsp_reports
igorFreire_UCI_real-time-dsp_reports
 
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...
Design and Implementation of Area Optimized, Low Complexity CMOS 32nm Technol...
 
Lab3
Lab3Lab3
Lab3
 
M3488 Application Note
M3488 Application NoteM3488 Application Note
M3488 Application Note
 
Bus
BusBus
Bus
 
Revelation pyconuk2016
Revelation pyconuk2016Revelation pyconuk2016
Revelation pyconuk2016
 
Design of Counter Using SRAM
Design of Counter Using SRAMDesign of Counter Using SRAM
Design of Counter Using SRAM
 
Arduino Microcontroller
Arduino MicrocontrollerArduino Microcontroller
Arduino Microcontroller
 
digital design of communication systems
digital design of communication systemsdigital design of communication systems
digital design of communication systems
 
Chapter 21
Chapter 21Chapter 21
Chapter 21
 
Ch21
Ch21Ch21
Ch21
 
40120140504012
4012014050401240120140504012
40120140504012
 
xstream_network
xstream_networkxstream_network
xstream_network
 

BMA

  • 1. Parallelization of the Berlekamp-Massey Algorithm Using SIMD Instructions and GPU Streams Presented By: Hamidreza Mohebbi Advisor: Ming Ouyang
  • 2. Topics u BMA Algorithm u BMA Implementation using SIMD Instructions u GPU Parallelization using Streams u Conclusion
  • 3. BMA Algorithm u Linear Feedback Shift Register u Due to ease of implementation, LFSR is widely used u Cryptography u GSM cell phone u Bluetooth u Scrambling u PCIe u SATA u USB u GbE 1 1 1
  • 4. BMA Algorithm u Given a finite binary sequence, find a shortest LFSR that generates the sequence u Elwyn Berlekamp u A paper/presentation at International Symposium on Information Theory, Italy, 1967 [1] u Algebraic coding theory, McGraw-Hill, 1968 [2] u James Massey u Shift-register synthesis and BCH decoding, IEEE Transactions on Information Theory, 1969 [3] u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”
  • 6. BMA Algorithm u Prof.Ouyang Previous Work[4] u Reverse S u Pack 32 bits into one word u Compute inner product u Count # of 1-bits in a 32-bit word u Update C(x) u GPU is faster than CPU reverse bit version for long input length (more than 2^22)
  • 7. BMA Implementation using SIMD Instructions • A data parallel architecture • Applying the same instruction to many data – Save control logic – A related architecture is the vector architecture – SIMD and vector architectures offer high performance for vector operations. • SSE (vector of size 4)
  • 8. BMA Implementation using SIMD Instructions u BMA-SSE u Using SSE instructions in computation and copying data u Doing computation for four consecutive elements at each time u Number of iteration is less than input length (in the best case scenario ¼ ), it depends on the input string u BMA-AVX u Using AVX instructions in computation and copying data u Doing computation for eight consecutive elements at each time u Number of iteration is less than input length (in the best case scenario 1/8 ), it depends on the input string
  • 9. BMA Implementation using SIMD Instructions 0 500 1000 1500 2000 2500 3000 3500 4000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Executiontime(sec) Input size (2 power of) cpuBit BMASSE input: random compile option: -O3
  • 10. BMA Implementation using SIMD Instructions 0 1000 2000 3000 4000 5000 6000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Executiontime(sec) Input size (2 power of) cpuBit BMASSE BMAAVX input: random compile option: -O1
  • 11. GPU Parallelization using Streams u Original BMA using GPU Kernels: u K1, K2, K3, and K4 are kernel functions. u xi and ki are the time of execution of each part. Serial K1 Serial K2 Serial K3 Serial K4 Serial x1 k1 x2 k2 x3 k3 x4 k4 x5
  • 12. GPU Parallelization using Streams u Ideal BMA using concurrent kernel execution Serial K1 Serial K2 Serial K3 Serial K4 Serial n * x1 k1 k2 k3 k4 n * x5 Serial K1 Serial K2 Serial K3 Serial K4 Serial . . . . Serial K1 Serial K2 Serial K3 Serial K4 Serial n * x2 n * x3 n * x4
  • 13. GPU Parallelization using Streams u The total of running n serial input is: u The time of ideal concurrent program: u The ideal speed up:
  • 14. GPU Parallelization using Streams u In the practical scenario, the hardware supports the S concurrent streams. So the time of concurrent is equal to: u The real speed up is:
  • 15. GPU Parallelization using Streams 0 20 40 60 80 100 120 140 160 180 200 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^16
  • 16. GPU Parallelization using Streams 0 500 1000 1500 2000 2500 3000 3500 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^20
  • 17. GPU Parallelization using Streams 0 2000 4000 6000 8000 10000 12000 14000 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^22
  • 18. GPU Parallelization using Streams 0 5000 10000 15000 20000 25000 30000 35000 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^23
  • 19. Conclusion u The best performance for input length less than 2^23 belong to SSE implementation on CPU. u Using more than 2 streams can reduce the execution time and for input length 2^23 it is faster than SSE and bit implementation with one stream. u The best performance between the GPU and SIMD implementation for input length more than 2^23 belong to GPU stream using 32 concurrent streams.
  • 20. References u [1] Berlekamp, Elwyn R. ,"Nonbinary BCH decoding", International Symposium on Information Theory, San Remo, Italy, 1967. u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA: Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New York, NY, 1968. u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans. Information Theory, IT-15 (1): 122–127, 1969. u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey Algorithm. Proceedings of the Second International Conference on Computing, Measurement, Control and Sensor Network (CMCSN), 2014.