SlideShare ist ein Scribd-Unternehmen logo
1 von 52
V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran
                       University of Belgrade
                        Oskar Mencer
                      Imperial College, London
                          Oliver Pell
             Maxeler Technologies, London and Palo Alto
                        Michael Flynn
                   Stanford University, Palo Alto
                      Valentina E. Balas
                                                              1/52
                   Aurel Vlaicu University of Arad
For Big Data algorithms
  and for the same hardware price as before,
  achieving:
   a) speed-up, 20-200
   b) monthly electricity bills, reduced 20 times
   c) size, 20 times smaller


                                               2/52
Absolutely all results achieved with:

  a) all hardware produced in Europe,
     specifically UK
  b) all software generated by programmers
     of EU and WB


                                             3/52
ControlFlow (MultiFlow and ManyFlow):
  Top500 ranks using Linpack (Japanese K,
)


DataFlow:
  Coarse Grain (HEP) vs. Fine Grain (Maxeler)



                                               4/52
Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.

The price to pay:
The machine is more difficult to program.

Consequently:
Ideal for WORM applications :)

Examples using Maxeler:
GeoPhysics (20-40), Banking (200-1000, with JP Morgan 20%),
  M&C (New York City), Datamining (Google), 

                                                              5/52
6
7/52
8/52
9
10
tCPU =                     tGPU =                      tDF = NOPS * CDF * TclkDF +
N * NOPS * CCPU*TclkCPU    N * NOPS * CGPU*TclkGPU /        (N – 1) * TclkDF / NDF
/NcoresCPU                 NcoresGPU




  Assumptions:
   1. Software includes enough parallelism to keep all cores busy
   2. The only limiting factor is the number of cores.                       11/52
Which way are the horses
DualCore?           going?




                                       12/52
Is it possible
 to use 2000 chicken instead of two horses?




                            ?
                           ==




What is better, real and anecdotic?

                                              13/52
2 x 1000 chickens (CUDA and rCUDA)
                                     14/52
D
                  at a
How about 2 000 000
ants?
                           15/52
Big Data Input   Results



                           Marmalade




                                       16/52
Factor: 20 to 200


      MultiCore/ManyCore        Dataflow




       Machine Level Code




                            Gate Transfer Level
                                                  17/52
Factor: 20

      MultiCore/ManyCore   Dataflow




                                      18/52
Factor: 20


      MultiCore/ManyCore      DataFlow


          Data Processing



                            Data Processing
          Process Control



                            Process Control


                                              19/52
 MultiCore:
   Explain what to do, to the driver
   Caches, instruction buffers, and predictors needed


 ManyCore:
   Explain what to do, to many sub-drivers
   Reduced caches and instruction buffers needed


 DataFlow:
   Make a field of processing gates: 1C+2nJava+3Java
   No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,
)


                                                                20/52
MultiCore:
   Business as usual


ManyCore:
   More difficult


DataFlow:
   Much more difficult
   Debugging both, application and configuration code



                                                         21/52
 MultiCore/ManyCore:
   Several minutes


 DataFlow:
   Several hours for the real hardware
   Fortunately, only several minutes for the simulator
   The simulator supports
    both the large JPMorgan machine
    as well as the smallest “University Support” machine

 Good news:
   Tabula@2GHz


                                                           22/52
23/52
MultiCore:
   Horse stable


ManyCore:
   Chicken house


DataFlow:
   Ant hole




                    24/52
MultiCore:
   Haystack


ManyCore:
   Cornbits


DataFlow:
   Crumbs




               25/52
Small Data: Toy Benchmarks (e.g., Linpack)

                                             26/52
Medium Data
(benchmarks
favorising NVidia,
compared to Intel,
)




                       27/52
Big Data




           28/52
 Revisiting the Top 500 SuperComputer Benchmarks
   Our paper in Communications of the ACM
 Revisiting all major Big Data DM algorithms
   Massive static parallelism at low clock frequencies
 Concurrency and communication
   Concurrency between millions of tiny cores difficult,
     “jitter” between cores will harm performance
     at synchronization points
 Reliability and fault tolerance
   10-100x fewer nodes, failures much less often
 Memory bandwidth and FLOP/byte ratio
   Optimize data choreography, data movement,
     and the algorithmic computation

                                                            29/52
Maxeler Hardware


        CPUs plus DFEs           DFEs shared over Infiniband     Low latency connectivity
Intel Xeon CPU cores and up to    Up to 8 DFEs with 384GB of   Intel Xeon CPUs and 1-2 DFEs
  4 DFEs with 192GB of RAM       RAM and dynamic allocation    with up to six 10Gbit Ethernet
                                    of DFEs to CPU servers              connections




               MaxWorkstation                            MaxCloud
               Desktop development system                On-demand scalable accelerated
                                                         compute resource, hosted in London

30/52
Major Classes of Algorithms,
        from the Computational Perspective

        1. Coarse grained, stateful: Business
          – CPU requires DFE for minutes or hours
        1. Fine grained, transactional with shared database: DM
          – CPU utilizes DFE for ms to s
          – Many short computations, accessing common database data
        1. Fine grained, stateless transactional: Science (FF)
          – CPU requires DFE for ms to s
          – Many short computations



31/52
Coarse Grained: Modeling
                               80

 ‱ Long runtime, but:          70
                                            Timesteps (thousand)
                                            Domain points (billion)
                               60

 ‱ Memory requirements         50           Total computed points (trillion)

                               40
   change dramatically based   30


   on modelled frequency
                               20
                               10


 ‱ Number of DFEs allocated
                                0
                                    0       10         20     30     40     50           60   70   80
                                                               Peak Frequency (Hz)
   to a CPU process can be      2,000

   easily varied to increase    1,800
                                1,600
                                            15Hz peak frequency
                                            30Hz peak frequency

   available memory             1,400
                                1,200
                                            45Hz peak frequency
                                            70Hz peak frequency


 ‱ Streaming compression
                                1,000
                                    800
                                    600

 ‱ Boundary data exchanged          400
                               U
                               o
                               n
                               u
                               q
                               P
                               C
                               e
                               a
                               E
                               v
                               c
                               s
                               r
                               t
                               l
                               i




                                    200

   over chassis MaxRing                 0
                                                   1                         4
                                                                  Number of MAX2 cards
                                                                                              8




32/52
Fine Grained, Shared Data: Monitoring
        ‱ DFE DRAM contains the database to be searched
        ‱ CPUs issue transactions find(x, db)
        ‱ Complex search function
           – Text search against documents
           – Shortest distance to coordinate (multi-dimensional)
           – Smith Waterman sequence alignment for genomes
        ‱ Any CPU runs on any DFE
          that has been loaded with the database
           – MaxelerOS may add or remove DFEs
             from the processing group to balance system demands
           – New DFEs must be loaded with the search DB before use

33/52
Fine Grained, Stateless: The BSOP Control
        ‱ Analyse > 1,000,000 scenarios
        ‱ Many CPU processes run on many DFEs
           – Each transaction executes on any DFE in the assigned group atomically
        ‱ ~50x MPC-X vs. multi-core x86 node


                     CPU
                    CPU                           DFE
                   CPU
                  CPU           Market and       DFE
                                                DFE
                                                                Loop over instruments
                                                                  Loop over instruments
                                                              Loop over instruments
                                                                Loop over instruments
                 CPU            instruments    DFE
                                              DFE
                                                            Loop over instruments
                                                              Loop over instruments
                                                          Loop over instruments
                                                            Loop over instruments
                                                        Loop over instruments
                                                          Loop over instruments
                                                                     Random number
                                                                      Random number
                                data                               Random number
                                                                    Random number
                                                                 Random number
                                                                   Random number
                                                                      generator and
                                                               Random numberand
                                                                       generator
                                                                 Random number
                                                                     generator and
                                                             Random numberand
                                                                      generator
                                                               Random number
                                                                   generator and
                                                                     generator and
                                                                 sampling of and
                       Tail
                        Tail                                       generator underliers
                                                               sampling of of underliers
                                                                   sampling underliers
                                                                 generator and
                     Tail
                       Tail                                  sampling of of underliers
                                                                 sampling underliers
                                                               generator and
                                                                 generator and
                    Tail
                     Tail                                  sampling of of underliers
                                                               sampling underliers
                   Tail
                    analysis
                    Tail
                     analysis                            sampling of of underliers
                                                             sampling underliers
                                                           sampling of underliers
                  Tail
                   analysis
                   Tail
                    analysis
                  analysis
                   analysis
                 analysis
                    onCPU
                        CPU
                  analysis
                analysis CPU
                     onCPU
                   onCPU
                 analysis
                    onCPU
                  onCPU
                   on
                 on CPU
                  on                                             Price instruments
                                                                  Price instruments
                                                               Price instruments
                                                                Price instruments
                on CPU
                 on CPU                                       Price instruments
                                                               Priceusing Black
                                                                       instruments
                                                                       using Black
                                                            Price instruments
                                                             Priceusing Black
                                                                     instruments
                                                                     using Black
                                                           Price instruments
                                                            Priceusing Scholes
                                                                   instruments
                                                                          Black
                                                                   using Black
                                                                        Scholes
                                                                using Scholes
                                                                         Black
                                                                  using Black
                                                                      Scholes
                                                               using Scholes
                                                                       Black
                                                                using Black
                                                                    Scholes
                                Instrument                        Scholes
                                                                    Scholes
                                                                 Scholes
                                                                  Scholes
                                values



34/52
Selected Examples



35/52
366/52
 36
 3
The CRS Results
        ïŹ
            Performance of one MAX2 card vs. 1 CPU core
             ïŹ
                 Land case (8 params), speedup of 230x
             ïŹ
                 Marine case (6 params), speedup of 190x
                           CPU Coherency             MAX2 Coherency




37/52
Seismic Imaging




  ‱ Running on MaxNode servers
    - 8 parallel compute pipelines per chip
        - 150MHz => low power consumption!
        - 30x faster than microprocessors
  An Implementation of the Acoustic Wave Equation on FPGAs
  T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
  †
    Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008
38/52
39
P. Marchetti et al, 2010


        Trace Stacking: Speed-up 217
        ‱ DM for Monitoring and Control in Seismic processing
        ‱ Velocity independent / data driven method
          to obtain a stack of traces, based on 8 parameters
            – Search for every sample of each output trace
                              2
            ïŁ«     2 T ïŁ¶ 2t0 T
t   2
    hyp   = ïŁŹ t0 + w m ïŁ· +
            ïŁŹ     v0   ïŁ·   v0
                                        (
                              m H zy K N H T m + h T H zy K NIP H T h
                                           zy                     zy                         )
            ïŁ­          ïŁž

               2 parameters ( emergence angle & azimuth )
               3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )

               3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )

41/52
42
43
44
Conclusion: Nota Bene
        This is about algorithmic changes,
                     to maximize
        the algorithm to architecture match:
                Data choreography,
                process modifications,
                        and
                 decision precision.

             The winning paradigm
               of Big Data ExaScale?


45/52
The TriPeak



                Siena
              + BSC
              + Imperial College
              + Maxeler
              + Belgrade




                                   46/52
                                     46/8
The TriPeak
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)

How about a happy marriage?
MontBlanc (ompSS) and Maxeler (an accelerator)

In each happy marriage,
it is known who does what :)

The Big Data DM algorithms:
What part goes to MontBlanc and what to Maxeler?


                                                   47/52
                                                      47/8
Core of the Symbiotic Success
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.

At compile time:
Checking what part of code fits where
(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K

At run time:
Rechecking the compile time decision,
based on the current data values.
                                           48/52
                                              48/8
Maxeler: Teaching (Google: prof
vm) VLSI, PowerPoints, Maxeler:
TEACHING,

Maxeler Veljko Explanations, August 2012
Maxeler Veljko Anegdotic,
Maxeler Oskar Talk, August 2012
Maxeler Forbes Article
Flyer by JP Morgan
Flyer by Maxeler HPC
Tutorial Slides by Sasha and Veljko: Practice (Current Update)
Paper, unconditionally accepted for Advances in Computers by Elsevier
Paper, unconditionally accepted for Communications of the ACM
Tutorial Slides by Oskar: Theory (7 parts)
Slides by Jacob, New York
Slides by Jacob, Alabama
Slides by Sasha: Practice (Current Update)
Maxeler in Meteorology
Maxeler in Mathematics
Examples generated in Belgrade and Worldwide

THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN,
   with an example

                                                                        49/52
                                                                        49/8
Maxeler: Research (Google: good
 method)
Structure of a Typical Research Paper: Scenario #1
[Comparison of Platforms for One Algorithm]
Curve A: MultiCore of approximately the same PurchasePrice
Curve B: ManyCore of approximately the same PurchasePrice
Curve C: Maxeler after a direct algorithm migration
Curve D: Maxeler after algorithmic improvements
Curve E: Maxeler after data choreography
Curve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2
[Ranking of Algorithms for One Application]
CurveSet A: Comparison of Algorithms on a MultiCore
CurveSet B: Comparison of Algorithms on a ManyCore
CurveSet C: Comparison on Maxeler, after a direct algorithm migration
CurveSet D: Comparison on Maxeler, after algorithmic improvements
CurveSet E: Comparison on Maxeler, after data choreography
CurveSet F: Comparison on Maxeler, after precision modifications
                                                                        50/52
                                                                        50/8
Maxeler: Topics (Google: HiPeac Berlin)

 SRB (TR):
 KG: Blood Flow
 NS: Combinatorial Math
 BG1: MiSANU Math
 BG2: Meteos Meteorology
 BG3: Physics (Gross Pitaevskii 3D real)
 BG4: Physics (Gross Pitaevskii 3D imaginary)
      (reusability with MPI/OpenMP vs effort to accelerate)

 FP7 (Call 11):
 University of Siena, Italy,
 ICL, UK,
 BSC, Spain,
 QPLAN, Greece,
 ETF, Serbia,
 IJS, Slovenia, 

                                                              51/52
                                                              51/8
Q&A




balas@drbalas.ro    52/52
                    52/8

Weitere Àhnliche Inhalte

Was ist angesagt?

Sound Devices 744T
Sound Devices 744TSound Devices 744T
Sound Devices 744TAV ProfShop
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specificationsinside-BigData.com
 
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–Keisuke Takahashi
 
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒł
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒłGlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒł
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒłKeisuke Takahashi
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performanceStephen Hemminger
 
Performance Analysis: C vs CUDA
Performance Analysis: C vs CUDAPerformance Analysis: C vs CUDA
Performance Analysis: C vs CUDAVitor Pamplona
 
Kramer 702T
Kramer 702TKramer 702T
Kramer 702TAV ProfShop
 
Sound Devices 702T
Sound Devices 702TSound Devices 702T
Sound Devices 702TAV ProfShop
 
USRP Project Final Report
USRP Project Final ReportUSRP Project Final Report
USRP Project Final ReportArjan Gupta
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaObsidian Software
 
Sound Devices 722
Sound Devices 722Sound Devices 722
Sound Devices 722AV ProfShop
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersVittorio Giovara
 
Computer relative short from....
Computer relative short from....Computer relative short from....
Computer relative short from....Nimai Chand Das
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 

Was ist angesagt? (16)

Sound Devices 744T
Sound Devices 744TSound Devices 744T
Sound Devices 744T
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specifications
 
One day-workshop on tms320 f2812
One day-workshop on tms320 f2812One day-workshop on tms320 f2812
One day-workshop on tms320 f2812
 
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–
GlusterFS ăƒąă‚žăƒ„ăƒŒăƒ«è¶…æŠ‚è«–
 
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒł
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒłGlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒł
GlusterFSćș§è«‡äŒšăƒ†ă‚Żăƒ‹ă‚«ăƒ«ă‚»ăƒƒă‚·ăƒ§ăƒł
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performance
 
Performance Analysis: C vs CUDA
Performance Analysis: C vs CUDAPerformance Analysis: C vs CUDA
Performance Analysis: C vs CUDA
 
Kramer 702T
Kramer 702TKramer 702T
Kramer 702T
 
Sound Devices 702T
Sound Devices 702TSound Devices 702T
Sound Devices 702T
 
Tms320 f2812
Tms320 f2812Tms320 f2812
Tms320 f2812
 
USRP Project Final Report
USRP Project Final ReportUSRP Project Final Report
USRP Project Final Report
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
 
Sound Devices 722
Sound Devices 722Sound Devices 722
Sound Devices 722
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
 
Computer relative short from....
Computer relative short from....Computer relative short from....
Computer relative short from....
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 

Andere mochten auch

Neurosynaptic chips
Neurosynaptic chipsNeurosynaptic chips
Neurosynaptic chipsJeffrey Funk
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakShelly Sanchez Terrell
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Andere mochten auch (6)

Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Neurosynaptic chips
Neurosynaptic chipsNeurosynaptic chips
Neurosynaptic chips
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Ähnlich wie Data flow super computing valentina balas

The von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyThe von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyPerry Lea
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale CapablSagar Dolas
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PROIDEA
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 AprFNian
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœșé‡äžŽæŒ‘æˆ˜æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战Weiwei Fang
 
New idc architecture
New idc architectureNew idc architecture
New idc architectureMason Mei
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Tao zhang
Tao zhangTao zhang
Tao zhangharishk2
 
M series technical presentation-part 1
M series technical presentation-part 1M series technical presentation-part 1
M series technical presentation-part 1xKinAnx
 
XMC4000 Brochure | Infineon Technologies
XMC4000 Brochure | Infineon TechnologiesXMC4000 Brochure | Infineon Technologies
XMC4000 Brochure | Infineon TechnologiesInfineon Technologies AG
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
Designing and deploying converged storage area networks final
Designing and deploying converged storage area networks finalDesigning and deploying converged storage area networks final
Designing and deploying converged storage area networks finalBhavin Yadav
 

Ähnlich wie Data flow super computing valentina balas (20)

The von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyThe von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 Apr
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœșé‡äžŽæŒ‘æˆ˜æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战
æ•°æźäž­ćżƒçœ‘ç»œç ”ç©¶ïŒšæœș遇䞎挑战
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Tao zhang
Tao zhangTao zhang
Tao zhang
 
M series technical presentation-part 1
M series technical presentation-part 1M series technical presentation-part 1
M series technical presentation-part 1
 
XMC4000 Brochure | Infineon Technologies
XMC4000 Brochure | Infineon TechnologiesXMC4000 Brochure | Infineon Technologies
XMC4000 Brochure | Infineon Technologies
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Designing and deploying converged storage area networks final
Designing and deploying converged storage area networks finalDesigning and deploying converged storage area networks final
Designing and deploying converged storage area networks final
 

KĂŒrzlich hochgeladen

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 

KĂŒrzlich hochgeladen (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 

Data flow super computing valentina balas

  • 1. V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto Valentina E. Balas 1/52 Aurel Vlaicu University of Arad
  • 2. For Big Data algorithms and for the same hardware price as before, achieving: a) speed-up, 20-200 b) monthly electricity bills, reduced 20 times c) size, 20 times smaller 2/52
  • 3. Absolutely all results achieved with: a) all hardware produced in Europe, specifically UK b) all software generated by programmers of EU and WB 3/52
  • 4. ControlFlow (MultiFlow and ManyFlow):  Top500 ranks using Linpack (Japanese K,
) DataFlow:  Coarse Grain (HEP) vs. Fine Grain (Maxeler) 4/52
  • 5. Compiling below the machine code level brings speedups; also a smaller power, size, and cost. The price to pay: The machine is more difficult to program. Consequently: Ideal for WORM applications :) Examples using Maxeler: GeoPhysics (20-40), Banking (200-1000, with JP Morgan 20%), M&C (New York City), Datamining (Google), 
 5/52
  • 6. 6
  • 9. 9
  • 10. 10
  • 11. tCPU = tGPU = tDF = NOPS * CDF * TclkDF + N * NOPS * CCPU*TclkCPU N * NOPS * CGPU*TclkGPU / (N – 1) * TclkDF / NDF /NcoresCPU NcoresGPU Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores. 11/52
  • 12. Which way are the horses DualCore? going? 12/52
  • 13. Is it possible to use 2000 chicken instead of two horses? ? == What is better, real and anecdotic? 13/52
  • 14. 2 x 1000 chickens (CUDA and rCUDA) 14/52
  • 15. D at a How about 2 000 000 ants? 15/52
  • 16. Big Data Input Results Marmalade 16/52
  • 17. Factor: 20 to 200 MultiCore/ManyCore Dataflow Machine Level Code Gate Transfer Level 17/52
  • 18. Factor: 20 MultiCore/ManyCore Dataflow 18/52
  • 19. Factor: 20 MultiCore/ManyCore DataFlow Data Processing Data Processing Process Control Process Control 19/52
  • 20.  MultiCore:  Explain what to do, to the driver  Caches, instruction buffers, and predictors needed  ManyCore:  Explain what to do, to many sub-drivers  Reduced caches and instruction buffers needed  DataFlow:  Make a field of processing gates: 1C+2nJava+3Java  No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,
) 20/52
  • 21. MultiCore:  Business as usual ManyCore:  More difficult DataFlow:  Much more difficult  Debugging both, application and configuration code 21/52
  • 22.  MultiCore/ManyCore:  Several minutes  DataFlow:  Several hours for the real hardware  Fortunately, only several minutes for the simulator  The simulator supports both the large JPMorgan machine as well as the smallest “University Support” machine  Good news:  Tabula@2GHz 22/52
  • 23. 23/52
  • 24. MultiCore:  Horse stable ManyCore:  Chicken house DataFlow:  Ant hole 24/52
  • 25. MultiCore:  Haystack ManyCore:  Cornbits DataFlow:  Crumbs 25/52
  • 26. Small Data: Toy Benchmarks (e.g., Linpack) 26/52
  • 28. Big Data 28/52
  • 29.  Revisiting the Top 500 SuperComputer Benchmarks  Our paper in Communications of the ACM  Revisiting all major Big Data DM algorithms  Massive static parallelism at low clock frequencies  Concurrency and communication  Concurrency between millions of tiny cores difficult, “jitter” between cores will harm performance at synchronization points  Reliability and fault tolerance  10-100x fewer nodes, failures much less often  Memory bandwidth and FLOP/byte ratio  Optimize data choreography, data movement, and the algorithmic computation 29/52
  • 30. Maxeler Hardware CPUs plus DFEs DFEs shared over Infiniband Low latency connectivity Intel Xeon CPU cores and up to Up to 8 DFEs with 384GB of Intel Xeon CPUs and 1-2 DFEs 4 DFEs with 192GB of RAM RAM and dynamic allocation with up to six 10Gbit Ethernet of DFEs to CPU servers connections MaxWorkstation MaxCloud Desktop development system On-demand scalable accelerated compute resource, hosted in London 30/52
  • 31. Major Classes of Algorithms, from the Computational Perspective 1. Coarse grained, stateful: Business – CPU requires DFE for minutes or hours 1. Fine grained, transactional with shared database: DM – CPU utilizes DFE for ms to s – Many short computations, accessing common database data 1. Fine grained, stateless transactional: Science (FF) – CPU requires DFE for ms to s – Many short computations 31/52
  • 32. Coarse Grained: Modeling 80 ‱ Long runtime, but: 70 Timesteps (thousand) Domain points (billion) 60 ‱ Memory requirements 50 Total computed points (trillion) 40 change dramatically based 30 on modelled frequency 20 10 ‱ Number of DFEs allocated 0 0 10 20 30 40 50 60 70 80 Peak Frequency (Hz) to a CPU process can be 2,000 easily varied to increase 1,800 1,600 15Hz peak frequency 30Hz peak frequency available memory 1,400 1,200 45Hz peak frequency 70Hz peak frequency ‱ Streaming compression 1,000 800 600 ‱ Boundary data exchanged 400 U o n u q P C e a E v c s r t l i 200 over chassis MaxRing 0 1 4 Number of MAX2 cards 8 32/52
  • 33. Fine Grained, Shared Data: Monitoring ‱ DFE DRAM contains the database to be searched ‱ CPUs issue transactions find(x, db) ‱ Complex search function – Text search against documents – Shortest distance to coordinate (multi-dimensional) – Smith Waterman sequence alignment for genomes ‱ Any CPU runs on any DFE that has been loaded with the database – MaxelerOS may add or remove DFEs from the processing group to balance system demands – New DFEs must be loaded with the search DB before use 33/52
  • 34. Fine Grained, Stateless: The BSOP Control ‱ Analyse > 1,000,000 scenarios ‱ Many CPU processes run on many DFEs – Each transaction executes on any DFE in the assigned group atomically ‱ ~50x MPC-X vs. multi-core x86 node CPU CPU DFE CPU CPU Market and DFE DFE Loop over instruments Loop over instruments Loop over instruments Loop over instruments CPU instruments DFE DFE Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Random number Random number data Random number Random number Random number Random number generator and Random numberand generator Random number generator and Random numberand generator Random number generator and generator and sampling of and Tail Tail generator underliers sampling of of underliers sampling underliers generator and Tail Tail sampling of of underliers sampling underliers generator and generator and Tail Tail sampling of of underliers sampling underliers Tail analysis Tail analysis sampling of of underliers sampling underliers sampling of underliers Tail analysis Tail analysis analysis analysis analysis onCPU CPU analysis analysis CPU onCPU onCPU analysis onCPU onCPU on on CPU on Price instruments Price instruments Price instruments Price instruments on CPU on CPU Price instruments Priceusing Black instruments using Black Price instruments Priceusing Black instruments using Black Price instruments Priceusing Scholes instruments Black using Black Scholes using Scholes Black using Black Scholes using Scholes Black using Black Scholes Instrument Scholes Scholes Scholes Scholes values 34/52
  • 37. The CRS Results ïŹ Performance of one MAX2 card vs. 1 CPU core ïŹ Land case (8 params), speedup of 230x ïŹ Marine case (6 params), speedup of 190x CPU Coherency MAX2 Coherency 37/52
  • 38. Seismic Imaging ‱ Running on MaxNode servers - 8 parallel compute pipelines per chip - 150MHz => low power consumption! - 30x faster than microprocessors An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§ † Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008 38/52
  • 39. 39
  • 40.
  • 41. P. Marchetti et al, 2010 Trace Stacking: Speed-up 217 ‱ DM for Monitoring and Control in Seismic processing ‱ Velocity independent / data driven method to obtain a stack of traces, based on 8 parameters – Search for every sample of each output trace 2 ïŁ« 2 T ïŁ¶ 2t0 T t 2 hyp = ïŁŹ t0 + w m ïŁ· + ïŁŹ v0 ïŁ· v0 ( m H zy K N H T m + h T H zy K NIP H T h zy zy ) ïŁ­ ïŁž 2 parameters ( emergence angle & azimuth ) 3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 ) 3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 ) 41/52
  • 42. 42
  • 43. 43
  • 44. 44
  • 45. Conclusion: Nota Bene This is about algorithmic changes, to maximize the algorithm to architecture match: Data choreography, process modifications, and decision precision. The winning paradigm of Big Data ExaScale? 45/52
  • 46. The TriPeak Siena + BSC + Imperial College + Maxeler + Belgrade 46/52 46/8
  • 47. The TriPeak MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM) Maxeler = A FineGrain DataFlow (FPGA) How about a happy marriage? MontBlanc (ompSS) and Maxeler (an accelerator) In each happy marriage, it is known who does what :) The Big Data DM algorithms: What part goes to MontBlanc and what to Maxeler? 47/52 47/8
  • 48. Core of the Symbiotic Success An intelligent DM algorithmic scheduler, partially implemented for compile time, and partially for run time. At compile time: Checking what part of code fits where (MontBlanc or Maxeler): LoC 1M vs 2K vs 20K At run time: Rechecking the compile time decision, based on the current data values. 48/52 48/8
  • 49. Maxeler: Teaching (Google: prof vm) VLSI, PowerPoints, Maxeler: TEACHING, Maxeler Veljko Explanations, August 2012 Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012 Maxeler Forbes Article Flyer by JP Morgan Flyer by Maxeler HPC Tutorial Slides by Sasha and Veljko: Practice (Current Update) Paper, unconditionally accepted for Advances in Computers by Elsevier Paper, unconditionally accepted for Communications of the ACM Tutorial Slides by Oskar: Theory (7 parts) Slides by Jacob, New York Slides by Jacob, Alabama Slides by Sasha: Practice (Current Update) Maxeler in Meteorology Maxeler in Mathematics Examples generated in Belgrade and Worldwide THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example 49/52 49/8
  • 50. Maxeler: Research (Google: good method) Structure of a Typical Research Paper: Scenario #1 [Comparison of Platforms for One Algorithm] Curve A: MultiCore of approximately the same PurchasePrice Curve B: ManyCore of approximately the same PurchasePrice Curve C: Maxeler after a direct algorithm migration Curve D: Maxeler after algorithmic improvements Curve E: Maxeler after data choreography Curve F: Maxeler after precision modifications Structure of a Typical Research Paper: Scenario #2 [Ranking of Algorithms for One Application] CurveSet A: Comparison of Algorithms on a MultiCore CurveSet B: Comparison of Algorithms on a ManyCore CurveSet C: Comparison on Maxeler, after a direct algorithm migration CurveSet D: Comparison on Maxeler, after algorithmic improvements CurveSet E: Comparison on Maxeler, after data choreography CurveSet F: Comparison on Maxeler, after precision modifications 50/52 50/8
  • 51. Maxeler: Topics (Google: HiPeac Berlin) SRB (TR): KG: Blood Flow NS: Combinatorial Math BG1: MiSANU Math BG2: Meteos Meteorology BG3: Physics (Gross Pitaevskii 3D real) BG4: Physics (Gross Pitaevskii 3D imaginary) (reusability with MPI/OpenMP vs effort to accelerate) FP7 (Call 11): University of Siena, Italy, ICL, UK, BSC, Spain, QPLAN, Greece, ETF, Serbia, IJS, Slovenia, 
 51/52 51/8
  • 52. Q&A balas@drbalas.ro 52/52 52/8

Hinweis der Redaktion

  1. Elastic makes things worse