SlideShare a Scribd company logo
1 of 40
Download to read offline
Parallel and Distributed
  Computing on Low
    Latecy Clusters

                Vittorio Giovara
    M. S. Electrical Engineering and Computer Science
              University of Illinois at Chicago
                          May 2009
Contents
•   Motivation            •   Application

•   Strategy                    •   Compiler Optimizations


•   Technologies                •   OpenMP and MPI over
                                    Infinband

      •   OpenMP
                          •   Results
      •   MPI
                          •   Conclusions
      •   Infinband
Motivation
Motivation

• Scaling trend has to stop for CMOS
  technology:
    ✓ Direct-tunneling limit in SiO2 ~3 nm
    ✓ Distance between Si atoms ~0.3 nm
    ✓ Variabilty


• Foundamental reason: rising fab cost
Motivation

• Easy to build multiple core processor
• Requires human action to modify and adapt
  concurrent software
• New classification for computer
  architectures
Classification
SISD                                                        SIMD
                     data pool                                                   data pool




                                                              instruction pool
  instruction pool




                       CPU                                                       CPU CPU



                        MISD                                                        MIMD
                                                data pool                                                  data pool



                                                                                        instruction pool
                             instruction pool




                                                  CPU                                                      CPU CPU

                                                  CPU                                                      CPU CPU
easier to parallelize




             abstraction level

algorithm
              loop level
                           process management
Levels
 recursion
  memory
management
  profiling
                                 data dependency
                                branching overhead
                                   control flow
       algorithm
                   loop level
                                process management

                                      SMP Multiprogramming
                                    Multithreading and Scheduling
Backfire

• Difficutly to fully exploit the parallelism
  offered
• Automatic tools required to adapt software
  to parallelism
• Compiler support for manual or semi-
  automatic enhancement
Applications
• OpenMP and MPI are two popular tools
  used to simplify the parallelizing process of
  both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and
     Background
• Sally3D is a micromagnetism program suit
  for field analysis and modeling developed at
  Politecnico di Torino (Department of
  Electrical Engineering)
• Computationally intensive (even days of
  CPU); speedup required
• Previous works still not fully encompassing
  the problem (no Infiniband or OpenMP
  +MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc
  configuration for scientific computation
• Compile a OpenMP enable GCC
  (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
  proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some
  MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
  directives in the code
• Run test cases
OpenMP

• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1                     Parallel Task 3




                  Parallel Task 2                     Parallel Task 4
Parallel Task 1         Parallel Task 2


Thread A
                                                              Parallel Task 4




            Thread B
                         Parallel Task 3




                                                       Join
  Master Thread
OpenMP Sceduler

• Which scheduler available for hardware?
   - Static
   - Dynamic
   - Guided
OpenMP Scheduler
                                                    OpenMP Static Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3     4   5      6       7      8        9       10       11     12    13    14   15   16
                                                            number of threads




                               chunk 1   chunk 10      chunk 100            chunk 1000        chunk 10000
OpenMP Scheduler
                                                         OpenMP Dynamic Scheduler Chart
               117000


               102375


               87750


               73125
microseconds




               58500


               43875


               29250


               14625


                   0
                        1   2    3        4    5         6    7      8       9        10   11     12    13    14   15   16
                                                                  number of threads



                                chunk 1       chunk 10       chunk 100        chunk 1000        chunk 10000
OpenMP Scheduler
                                                        OpenMP Guided Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3        4    5     6        7      8        9       10   11      12     13   14   15   16
                                                                number of threads




                                   chunk 1   chunk 10       chunk 100       chunk 1000        chunk 10000
OpenMP Scheduler
OpenMP Scheduler




   static scheduler   dynamic scheduler   guided scheduler
MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
      - OpenMPI
      - MVAPICH
Infiniband

• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs




 1000000,0 µs




  100000,0 µs




   10000,0 µs




    1000,0 µs




     100,0 µs




      10,0 µs




       1,0 µs
                kB
                     kB
                          kB
                               kB

                                    kB
                                               kB

                                          12 B
                                          25 B
                                          51 B
                                               kB

                                                 B
                                                 B
                                                 B
                                                 B

                                          32 B
                                          64 B
                                         12 B
                                         25 B
                                         51 B
                                                 B
                                                 B
                                                 B
                                                 B
                                                 B

                                                 B
                                               k

                                               k
                                               k


                                              M
                                              M
                                              M
                                              M

                                              M
                                              M
                                              M

                                              M
                                              M
                                              M
                                               G
                                               G
                                               G
                                               G

                                               G
                1
                    2
                        4
                            8
                                16
                                     32
                                           64

                                             8
                                             6
                                             2




                                            1
                                            2
                                            4
                                            8
                                           16
                                            1
                                            2
                                            4
                                            8
                                          16




                                            8
                                            6
                                            2
                                          OpenMPI   Mvapich2
MPI over Infiniband
10000000,00 µs




 1000000,00 µs




  100000,00 µs




   10000,00 µs




    1000,00 µs




     100,00 µs




      10,00 µs




       1,00 µs
                 kB


                      kB


                           kB


                                kB


                                        kB


                                               kB


                                                     kB


                                                            kB


                                                                   kB


                                                                          kB


                                                                                   B


                                                                                           B


                                                                                                   B


                                                                                                           B
                                                                               M


                                                                                       M


                                                                                               M


                                                                                                       M
                 1


                      2


                           4


                                8


                                      16


                                              32


                                                    64



                                                             8


                                                                    6


                                                                           2


                                                                               1


                                                                                       2


                                                                                               4


                                                                                                       8
                                                          12


                                                                 25


                                                                        51




                                    OpenMPI                  Mvapich2
Optimizations

• Active at compile time
• Available only after porting the software to
  standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations

•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software

• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
  mathematical models
Implementation Scheme
          sequential loop                    parallel loop

  standard
programming
    model

                                              OpenMP Threads
      distributed loop


     OpenMP Threads         OpenMP Threads
         Host 1               Host 2
                      MPI
Implementation
         Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
  obiects required
  ➡ send() and recv() mechanism
  ➡ critical regions using OpenMP directives
  ➡ functions merging
  ➡ matrix conversion
Results
Results
  OMP    MPI    OPT   seconds
   *      *      *       133
   *      *      -       400
   *      -      *       186
   *      -      -       487
   -      *      *       200
   -      *      -       792
   -      -      *       246
   -      -      -      1062




Total Speed Increase: 87.52%
Actual Results
                  OMP       MPI      seconds
                   *         *          59
                   *         -         129
                   -         *         174
                   -         -         249




  Function Name   Normal    OpenMP      MPI     OpenMP+MPI
calc_intmudua      24.5 s    4.7 s     14.4 s      2.8 s
calc_hdmg_tet      16.9 s    3.0 s     10.8 s      1.7 s
calc_mudua         12.1 s    1.9 s      7.0 s      1.1 s
campo_effettivo    17.7 s    4.5 s      9.9 s      2.3 s
Actual Results

• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x

     Total Raw Speed Increment: 76%
Conclusions
Conclusions and
        Future Works
• Computational time has been significantly
    decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
•   Continue inserting OpenMP and MPI directives
•   Perform algorithm optimizations
•   Increase cluster size

More Related Content

What's hot

Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMDKonrad Witte
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturneRenuda SARL
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Gaurav Raina
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 

What's hot (15)

ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMD
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_Saturne
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 

Viewers also liked

Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEvaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEuroCloud
 
Cloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesCloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesLiberteks
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)Sri Prasanna
 
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationClusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationWolfgang Gentzsch
 
SLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSUSE
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)TASNEEM88
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 

Viewers also liked (16)

Cloud Computing
Cloud Computing Cloud Computing
Cloud Computing
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Grid
GridGrid
Grid
 
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEvaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
 
Cloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesCloud Computing Clusters for Dummies
Cloud Computing Clusters for Dummies
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
Chapter16 new
Chapter16 newChapter16 new
Chapter16 new
 
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationClusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
 
SLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo Cluster
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 

Similar to Parallel and Distributed Computing on Low Latency Clusters

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Accenture
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Accenture
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scalexcbsmith
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Betting On Data Grids
Betting On Data GridsBetting On Data Grids
Betting On Data Gridsgojkoadzic
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Obsidian Software
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs systèmeLudovic Piot
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosBrent Salisbury
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 

Similar to Parallel and Distributed Computing on Low Latency Clusters (20)

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Betting On Data Grids
Betting On Data GridsBetting On Data Grids
Betting On Data Grids
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2
 
Dasia 2022
Dasia 2022Dasia 2022
Dasia 2022
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs système
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 

More from Vittorio Giovara

Color me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoColor me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoVittorio Giovara
 
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyAn overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyVittorio Giovara
 
Introduction to video reverse engineering
Introduction to video reverse engineeringIntroduction to video reverse engineering
Introduction to video reverse engineeringVittorio Giovara
 
Block Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationBlock Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationVittorio Giovara
 
Fuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryFuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryVittorio Giovara
 
Software Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsSoftware Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsVittorio Giovara
 
Microprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmMicroprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmVittorio Giovara
 
Misra C Software Development Standard
Misra C Software Development StandardMisra C Software Development Standard
Misra C Software Development StandardVittorio Giovara
 
OpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatOpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatVittorio Giovara
 
Authenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmAuthenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmVittorio Giovara
 

More from Vittorio Giovara (13)

Color me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoColor me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in video
 
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyAn overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
 
Introduction to video reverse engineering
Introduction to video reverse engineeringIntroduction to video reverse engineering
Introduction to video reverse engineering
 
Il Caso Ryanair
Il Caso RyanairIl Caso Ryanair
Il Caso Ryanair
 
I Mercati Geografici
I Mercati GeograficiI Mercati Geografici
I Mercati Geografici
 
Block Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationBlock Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For Authentication
 
Crittografia Quantistica
Crittografia QuantisticaCrittografia Quantistica
Crittografia Quantistica
 
Fuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryFuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability Discovery
 
Software Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsSoftware Requirements for Safety-related Systems
Software Requirements for Safety-related Systems
 
Microprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmMicroprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division Algorithm
 
Misra C Software Development Standard
Misra C Software Development StandardMisra C Software Development Standard
Misra C Software Development Standard
 
OpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatOpenSSL User Manual and Data Format
OpenSSL User Manual and Data Format
 
Authenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmAuthenticated Encryption Gcm Ccm
Authenticated Encryption Gcm Ccm
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Parallel and Distributed Computing on Low Latency Clusters

  • 1. Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009
  • 2. Contents • Motivation • Application • Strategy • Compiler Optimizations • Technologies • OpenMP and MPI over Infinband • OpenMP • Results • MPI • Conclusions • Infinband
  • 4. Motivation • Scaling trend has to stop for CMOS technology: ✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty • Foundamental reason: rising fab cost
  • 5. Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software • New classification for computer architectures
  • 6. Classification SISD SIMD data pool data pool instruction pool instruction pool CPU CPU CPU MISD MIMD data pool data pool instruction pool instruction pool CPU CPU CPU CPU CPU CPU
  • 7. easier to parallelize abstraction level algorithm loop level process management
  • 8. Levels recursion memory management profiling data dependency branching overhead control flow algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling
  • 9. Backfire • Difficutly to fully exploit the parallelism offered • Automatic tools required to adapt software to parallelism • Compiler support for manual or semi- automatic enhancement
  • 10. Applications • OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software • Mathematics and Physics • Computer Science • Biomedics
  • 11. Specific Problem and Background • Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering) • Computationally intensive (even days of CPU); speedup required • Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)
  • 13. Strategy • Install a Linux Kernel with ad-hoc configuration for scientific computation • Compile a OpenMP enable GCC (supported from 4.3.1 onwards) • Add the Infiniband link among clusters with proper drivers in kernel and user space • Select a MPI implementation library
  • 14. Strategy • Verify Infiniband network through some MPI test examples • Install the target software • Proceed to include OpenMP and MPI directives in the code • Run test cases
  • 15. OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods
  • 17. OpenMP - example Parallel Task 1 Parallel Task 3 Parallel Task 2 Parallel Task 4
  • 18. Parallel Task 1 Parallel Task 2 Thread A Parallel Task 4 Thread B Parallel Task 3 Join Master Thread
  • 19. OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided
  • 20. OpenMP Scheduler OpenMP Static Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 21. OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000 102375 87750 73125 microseconds 58500 43875 29250 14625 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 22. OpenMP Scheduler OpenMP Guided Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 24. OpenMP Scheduler static scheduler dynamic scheduler guided scheduler
  • 25. MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH
  • 26. Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed
  • 27. MPI over Infiniband 10000000,0 µs 1000000,0 µs 100000,0 µs 10000,0 µs 1000,0 µs 100,0 µs 10,0 µs 1,0 µs kB kB kB kB kB kB 12 B 25 B 51 B kB B B B B 32 B 64 B 12 B 25 B 51 B B B B B B B k k k M M M M M M M M M M G G G G G 1 2 4 8 16 32 64 8 6 2 1 2 4 8 16 1 2 4 8 16 8 6 2 OpenMPI Mvapich2
  • 28. MPI over Infiniband 10000000,00 µs 1000000,00 µs 100000,00 µs 10000,00 µs 1000,00 µs 100,00 µs 10,00 µs 1,00 µs kB kB kB kB kB kB kB kB kB kB B B B B M M M M 1 2 4 8 16 32 64 8 6 2 1 2 4 8 12 25 51 OpenMPI Mvapich2
  • 29. Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN • Consistent documentation available • Unexpected positive results
  • 32. Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models
  • 33. Implementation Scheme sequential loop parallel loop standard programming model OpenMP Threads distributed loop OpenMP Threads OpenMP Threads Host 1 Host 2 MPI
  • 34. Implementation Scheme • Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required ➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion
  • 36. Results OMP MPI OPT seconds * * * 133 * * - 400 * - * 186 * - - 487 - * * 200 - * - 792 - - * 246 - - - 1062 Total Speed Increase: 87.52%
  • 37. Actual Results OMP MPI seconds * * 59 * - 129 - * 174 - - 249 Function Name Normal OpenMP MPI OpenMP+MPI calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
  • 38. Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%
  • 40. Conclusions and Future Works • Computational time has been significantly decreased • Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • Continue inserting OpenMP and MPI directives • Perform algorithm optimizations • Increase cluster size