SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Critical Issues at Exascale
 for Algorithm and Software Design
SC12, Salt Lake City, Utah, Nov 2012
    Jack Dongarra, University of Tennessee, Tennessee, USA
Performance Development in Top500
    1E+11
    1E+10
   1 Eflop/s
   1E+09
10000000
 100 Pflop/s
  10 Pflop/s
10000000
 1000000
   1 Pflop/s                 N=1
  100000
 100 Tflop/s

   10000
  10 Tflop/s

   1 1000
     Tflop/s                 N=500
      100
 100 Gflop/s
        10
  10 Gflop/s
         1
   1 Gflop/s
               1996




                      2002




                                     2008




                                            2014




                                                   2020
       0.1
Potential System Architecture

Systems                      2012                 2022            Difference
                          Titan Computer                         Today & 2022
System peak                27 Pflop/s           1 Eflop/s            O(100)

Power                       8.3 MW              ~20 MW
                           (2 Gflops/W)        (50 Gflops/W)

System memory                710 TB            32 - 64 PB             O(10)
                            (38*18688)

Node performance            1,452 GF/s        1.2 or 15TF/s       O(10) – O(100)
                              (1311+141)


Node memory BW              232 GB/s            2 - 4TB/s            O(1000)
                             (52+180)

Node concurrency          16 cores CPU        O(1k) or 10k       O(100) – O(1000)
                           2688 CUDA
                              cores
Total Node Interconnect      8 GB/s           200-400GB/s             O(10)
BW
System size (nodes)          18,688        O(100,000) or O(1M)   O(100) – O(1000)

Total concurrency             50 M              O(billion)          O(1,000)

MTTI                       ?? unknown           O(<1 day)            - O(10)
Potential System Architecture
    with a cap of $200M and 20MW
Systems                      2012                 2022            Difference
                          Titan Computer                         Today & 2022
System peak                27 Pflop/s           1 Eflop/s            O(100)

Power                       8.3 MW              ~20 MW
                           (2 Gflops/W)        (50 Gflops/W)

System memory                710 TB            32 - 64 PB             O(10)
                            (38*18688)

Node performance            1,452 GF/s        1.2 or 15TF/s       O(10) – O(100)
                              (1311+141)


Node memory BW              232 GB/s            2 - 4TB/s            O(1000)
                             (52+180)

Node concurrency          16 cores CPU        O(1k) or 10k       O(100) – O(1000)
                           2688 CUDA
                              cores
Total Node Interconnect      8 GB/s           200-400GB/s             O(10)
BW
System size (nodes)          18,688        O(100,000) or O(1M)   O(100) – O(1000)

Total concurrency             50 M              O(billion)          O(1,000)

MTTI                       ?? unknown           O(<1 day)            - O(10)
Critical Issues at Peta & Exascale for
Algorithm and Software Design
 Synchronization-reducing algorithms
   Break Fork-Join model
 Communication-reducing algorithms
   Use methods which have lower bound on communication
 Mixed precision methods
   2x speed of ops and 2x speed for data movement
 Autotuning
   Today’s machines are too complicated, build “smarts” into
    software to adapt to the hardware
 Fault resilient algorithms
   Implement algorithms that can recover from failures/bit
    flips
 Reproducibility of results
   Today we can’t guarantee this. We understand the issues,
                            5
    but some of our “colleagues” have a hard time with this.
Major Changes to Algorithms/Software
• Must rethink the design of our
 algorithms and software
  Manycore and Hybrid architectures are
   disruptive technology
    Similar to what happened with cluster
     computing and message passing


  Rethink and rewrite the applications,
   algorithms, and software

  Data movement is expensive
  Flops are cheap 6
Fork-Join Parallelization of LU and QR.
Parallelize the update:                             dgemm
    • Easy and done in any reasonable software.
    • This is the 2/3n3 term in the FLOPs count.               -
    • Can be done efficiently with LAPACK+multithreaded BLAS




                       Cores
Synchronization (in LAPACK LU)



    Step 1             Step 2              Step 3             Step 4   ...




                     synchronous processing
    • Fork-join, bulk fork join                                         27

                     bulk synchronous processing


                                     8
Allowing for delayed update, out of order, asynchronous, dataflow execution
PLASMA/MAGMA: Parallel Linear Algebra
 s/w for Multicore/Hybrid Architectures
Objectives
  High utilization of each core
  Scaling to large number of cores
  Synchronization reducing algorithms
Methodology
    Dynamic DAG scheduling (QUARK)
    Explicit parallelism
    Implicit communication
    Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling
                                               Fork-join
                                               parallelism

                               DAG scheduled
                               parallelism
                                                             9
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           10
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           11
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           12
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           13
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           14
PowerPack 2.0




The PowerPack platform consists of software and hardware instrumentation.
                                                                       15
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for QR Factorization
                                LAPACK’s QR Factorization
                                Fork-join based


                                MKL’s QR Factorization
                                Fork-join based

                                PLASMA’s Conventional
                                QR Factorization
                                DAG based
                                PLASMA’s Communication
                                Reducing QR Factorization
                                DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS                          16
matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Weitere ähnliche Inhalte

Was ist angesagt?

Report to the NAC
Report to the NACReport to the NAC
Report to the NACLarry Smarr
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017top500
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSantosh Verma
 
Top500 SC17 invited talk
Top500 SC17 invited talkTop500 SC17 invited talk
Top500 SC17 invited talktop500
 
33C3: Code BROWN in the Air
33C3: Code BROWN in the Air33C3: Code BROWN in the Air
33C3: Code BROWN in the AirPhilippe Lin
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Nansen Chen
 
Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4dseagrave
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slidestop500
 
CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500Cisco Russia
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone .com
 
Amat p5000 etcher semi star
Amat p5000 etcher   semi starAmat p5000 etcher   semi star
Amat p5000 etcher semi starEmily Tan
 
Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionNansen Chen
 
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schAcer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schjuan carlos pintos
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафикаSkillFactory
 

Was ist angesagt? (20)

No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Supercomputers and Cloud Games
Supercomputers and Cloud GamesSupercomputers and Cloud Games
Supercomputers and Cloud Games
 
Sponge v2
Sponge v2Sponge v2
Sponge v2
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NAC
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
Top500 SC17 invited talk
Top500 SC17 invited talkTop500 SC17 invited talk
Top500 SC17 invited talk
 
33C3: Code BROWN in the Air
33C3: Code BROWN in the Air33C3: Code BROWN in the Air
33C3: Code BROWN in the Air
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
 
Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slides
 
CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
 
Amat p5000 etcher semi star
Amat p5000 etcher   semi starAmat p5000 etcher   semi star
Amat p5000 etcher semi star
 
Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon Reduction
 
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schAcer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафика
 

Andere mochten auch

L presentation
L presentationL presentation
L presentationj2chris
 
Presentation1
Presentation1Presentation1
Presentation1kerangps
 
Thunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationThunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationElyse Meyer
 
Magazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaMagazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaClara Luz Roldán
 
10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa runUlikblogRunner
 
Scheduling Updates nov 2013
Scheduling Updates nov 2013Scheduling Updates nov 2013
Scheduling Updates nov 2013ClubReady Inc
 
Didakticheskie materialy
Didakticheskie materialyDidakticheskie materialy
Didakticheskie materialyAlisha_Rum
 
áLbum nº 1 xadrez
áLbum nº 1   xadrezáLbum nº 1   xadrez
áLbum nº 1 xadrezcepmaio
 
Turma m8 noturno 2011.1
Turma m8    noturno 2011.1Turma m8    noturno 2011.1
Turma m8 noturno 2011.1cepmaio
 
Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Cagliostro Puntodue
 
øStgaard bokgren on_stress
øStgaard bokgren on_stressøStgaard bokgren on_stress
øStgaard bokgren on_stressJens östgaard
 
Supporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseSupporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseWilliam Hall
 
Dos and don'ts of job interview
Dos and don'ts of job interviewDos and don'ts of job interview
Dos and don'ts of job interviewPedro Daniel Cosme
 
1993 de f a z ok
1993 de f a z ok1993 de f a z ok
1993 de f a z okcepmaio
 

Andere mochten auch (20)

L presentation
L presentationL presentation
L presentation
 
Presentation1
Presentation1Presentation1
Presentation1
 
Thunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationThunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL Preparation
 
Magazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaMagazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali Colombia
 
10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run
 
Scheduling Updates nov 2013
Scheduling Updates nov 2013Scheduling Updates nov 2013
Scheduling Updates nov 2013
 
Transferencia
TransferenciaTransferencia
Transferencia
 
Didakticheskie materialy
Didakticheskie materialyDidakticheskie materialy
Didakticheskie materialy
 
김민경
김민경김민경
김민경
 
Class 1. ss
Class 1. ssClass 1. ss
Class 1. ss
 
áLbum nº 1 xadrez
áLbum nº 1   xadrezáLbum nº 1   xadrez
áLbum nº 1 xadrez
 
Turma m8 noturno 2011.1
Turma m8    noturno 2011.1Turma m8    noturno 2011.1
Turma m8 noturno 2011.1
 
Automatic enrolment
Automatic enrolmentAutomatic enrolment
Automatic enrolment
 
Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103
 
øStgaard bokgren on_stress
øStgaard bokgren on_stressøStgaard bokgren on_stress
øStgaard bokgren on_stress
 
Pilttekstid
PilttekstidPilttekstid
Pilttekstid
 
Supporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseSupporting business decisions in the technological enterprise
Supporting business decisions in the technological enterprise
 
Canjs
CanjsCanjs
Canjs
 
Dos and don'ts of job interview
Dos and don'ts of job interviewDos and don'ts of job interview
Dos and don'ts of job interview
 
1993 de f a z ok
1993 de f a z ok1993 de f a z ok
1993 de f a z ok
 

Ähnlich wie Critical Issues at Exascale for Algorithm and Software Design

POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewAlexander Grudanov
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
New idc architecture
New idc architectureNew idc architecture
New idc architectureMason Mei
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBMIBM Danmark
 
Optical Switching in the Datacenter
Optical Switching in the DatacenterOptical Switching in the Datacenter
Optical Switching in the DatacenterKostas Katrinis
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDKLagopus SDN/OpenFlow switch
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaJim St. Leger
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002MOHAMMED FURQHAN
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps Mohd Sohail
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesDiego Kreutz
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
 
Study and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlayStudy and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlaySatya Prakash Rout
 

Ähnlich wie Critical Issues at Exascale for Algorithm and Software Design (20)

POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overview
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
Optical Switching in the Datacenter
Optical Switching in the DatacenterOptical Switching in the Datacenter
Optical Switching in the Datacenter
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002
 
Kubernetes-DX-5G-session
Kubernetes-DX-5G-sessionKubernetes-DX-5G-session
Kubernetes-DX-5G-session
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunities
 
0507036
05070360507036
0507036
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012
 
Study and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlayStudy and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple Play
 

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Critical Issues at Exascale for Algorithm and Software Design

  • 1. Critical Issues at Exascale for Algorithm and Software Design SC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
  • 2. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 10000000 100 Pflop/s 10 Pflop/s 10000000 1000000 1 Pflop/s N=1 100000 100 Tflop/s 10000 10 Tflop/s 1 1000 Tflop/s N=500 100 100 Gflop/s 10 10 Gflop/s 1 1 Gflop/s 1996 2002 2008 2014 2020 0.1
  • 3. Potential System Architecture Systems 2012 2022 Difference Titan Computer Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100) Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W) System memory 710 TB 32 - 64 PB O(10) (38*18688) Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141) Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180) Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA cores Total Node Interconnect 8 GB/s 200-400GB/s O(10) BW System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 50 M O(billion) O(1,000) MTTI ?? unknown O(<1 day) - O(10)
  • 4. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Titan Computer Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100) Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W) System memory 710 TB 32 - 64 PB O(10) (38*18688) Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141) Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180) Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA cores Total Node Interconnect 8 GB/s 200-400GB/s O(10) BW System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 50 M O(billion) O(1,000) MTTI ?? unknown O(<1 day) - O(10)
  • 5. Critical Issues at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms  Break Fork-Join model Communication-reducing algorithms  Use methods which have lower bound on communication Mixed precision methods  2x speed of ops and 2x speed for data movement Autotuning  Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms  Implement algorithms that can recover from failures/bit flips Reproducibility of results  Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
  • 6. Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
  • 7. Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
  • 8. Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ...  synchronous processing • Fork-join, bulk fork join 27  bulk synchronous processing 8 Allowing for delayed update, out of order, asynchronous, dataflow execution
  • 9. PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures Objectives  High utilization of each core  Scaling to large number of cores  Synchronization reducing algorithms Methodology  Dynamic DAG scheduling (QUARK)  Explicit parallelism  Implicit communication  Fine granularity / block data layout Arbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
  • 10. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
  • 11. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
  • 12. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
  • 13. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
  • 14. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
  • 15. PowerPack 2.0 The PowerPack platform consists of software and hardware instrumentation. 15 Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
  • 16. Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG based dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLAS 16 matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Hinweis der Redaktion

  1. Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2  = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad.  That&apos;s **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -&gt; 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
  2. Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2  = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad.  That&apos;s **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -&gt; 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
  3. The PowerPack platform consists of software and hardware instrumentation. Hardware tools include a WattsUp Pro meter and a NI meter, which connect to the monitored system to measure system-level power and component-level power streams.
  4. The system power rate is measured prior to the AC/DC converter, which has an efficiency of 80-85%So during this process, you are loosing power... Which makes the numbers not adding upIntel Xeon E5462 (Harpertown) processor with 8 DIMMs or 16 GB memory (each DIMM is 2 GB DDR2).