SlideShare ist ein Scribd-Unternehmen logo
1 von 14
LAMMPS, Dec. 2011 or later
Summary/Conclusions
Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

      Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
More Science for Your Money
                                                  Embedded Atom Model                                           Blue node uses 2x E5-2687W (8 Cores
                               6                                                                                and 150W per CPU).
                                                                                                       5.5
                                                                                                                Green nodes have 2x E5-2687W and 1
                               5                                                                                or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only




                                                                                            4.5


                               4
                                                                                 3.3
                                                                      2.92
                               3
                                                           2.47

                               2                1.7


                               1



                               0
                                   CPU Only   CPU + 1x   CPU + 1x   CPU + 1x   CPU + 2x   CPU + 2x   CPU + 2x
                                                K10        K20       K20X        K10        K20       K20X


                                   Experience performance increases of up to 5.5x with Kepler GPU nodes.
K20X, the Fastest GPU Yet
                                7                                                                   Blue node uses 2x E5-2687W (8 Cores
                                                                                                    and 150W per CPU).
                                6
                                                                                                    Green nodes have 2x E5-2687W and 2
                                                                                                    NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone




                                5


                                4


                                3


                                2


                                1


                                0
                                         CPU Only     CPU + 2x M2090   CPU + K20X   CPU + 2x K20X




                                    Experience performance increases of up to 6.2x with Kepler GPU nodes.
                                    One K20X performs as well as two M2090s
Get a CPU Rebate to Fund Part of Your GPU Budget
                               Acceleration in Loop Time Computation by
                                            Additional GPUs
                                                                                                                Running NAMD version 2.9
                          20
                                                                                                  18.2
                                                                                                                The blue node contains Dual X5670 CPUs
                          18
                                                                                                                (6 Cores per CPU).
                          16
                                                                                                                The green nodes contain Dual X5570 CPUs
 Normalized to CPU Only




                          14                                                     12.9                           (4 Cores per CPU) and 1-4 NVIDIA M2090
                                                                                                                GPUs.
                          12
                                                                9.88
                          10

                           8

                           6                   5.31

                           4

                           2

                           0
                                1 Node   1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090


                                 Increase performance 18x when compared to CPU-only nodes


                          Cheaper CPUs used with GPUs AND still faster overall performance when
                                          compared to more expensive CPUs!
Excellent Strong Scaling on Large Clusters
                                                 LAMMPS Gay-Berne 134M Atoms

                            600
                                                                                                    GPU Accelerated XK6
                            500
                                                                                                    CPU only XE6
      Loop Time (seconds)




                            400
                                       3.55x
                            300


                            200
                                                                              3.48x
                                                                                                                            3.45x
                            100


                              0
                                     300           400          500          600          700          800            900
                                                                            Nodes

    From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
                            compared to XE6 CPU nodes
                            Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
                            Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
GPUs Sustain 5x Performance for Weak Scaling
                                                Weak Scaling with 32K Atoms per Node
                              45

                              40

        Loop Time (seconds)   35

                              30
                                         6.7x                        5.8x                       4.8x
                              25

                              20

                              15

                              10

                               5

                               0
                                     1           8     27    64    125      216   343   512   729
                                                                  Nodes
                                   Performance of 4.8x-6.7x with GPU-accelerated nodes
                                             when compared to CPUs alone
     Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
     Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
Faster, Greener — Worth It!
                         Energy Consumed in one loop of EAM
                       140


                       120                                                            GPU-accelerated computing uses
                                           Lower is better                             53% less energy than CPU only
                       100
Energy Expended (kJ)




                        80


                        60
                                                                                    Energy Expended = Power x Time
                                                                                    Power calculated by combining the component’s TDPs
                        40


                        20


                         0
                              1 Node         1 Node + 1 K20X    1 Node + 2x K20X




                              Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
                              Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.
Molecular Dynamics with LAMMPS
 on a Hybrid Cray Supercomputer
                    W. Michael Brown
        National Center for Computational Sciences
              Oak Ridge National Laboratory

      NVIDIA Technology Theater, Supercomputing 2012
                     November 14, 2012
Early Kepler Benchmarks on Titan
                      32.00                                                                             4
                      16.00
                                                                   XK7+GPU
                       8.00
                       4.00                                        XK6                                  3




               Time (s)
Atomic Fluid           2.00




                                                                                                            Time (s)
                                                                   XK6+GPU
                       1.00                                                                             2
                       0.50                                        XK7+GPU
                       0.25                                        XK6
                       0.13                                                                             1
                                                                   XK6+GPU
                       0.06
                       0.03                                                                             0
                                   1   2   4   8   16 32 64 128   Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                              64

                                                                                                6


                                                                                              96
                                                                                              24



                                                                                               4
                                                                                             25




                                                                                             38
                                                                                           40
                                                                                           10


                                                                                          16
                                                                                                        3.0
                            8.00
                                                                  XK7+GPU                               2.5
                            4.00
                                                                                                        2.0




                                                                                                                   Time (s)
                            2.00
                 Time (s)




Bulk Copper                                                       XK6                                   1.5
                            1.00
                                                                                                        1.0
                            0.50                                  XK6+GPU
                                                                                                        0.5
                            0.25
                                                                                                        0.0
                            0.13                                  Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                          64

                                                                                                    6


                                                                                                   96
                                                                                                   24



                                                                                                    4
                                                                                                  25




                                                                                                 38
                                   1   2   4   8   16 32 64 128




                                                                                                40
                                                                                                10


                                                                                               16
Early Kepler Benchmarks on Titan
                             64.00                                                                                               32
                             32.00
                                                                       XK7+GPU                                                   16
                             16.00




                      Time (s)
Protein                                                                                                                          8




                                                                                                                                      Time (s)
                                 8.00
                                                                       XK6
                                 4.00                                                                                            4
                                 2.00
                                                                       XK6+GPU                                                   2
                                 1.00
                                 0.50                                                                                            1




                                                                                 1

                                                                                       4

                                                                                           16

                                                                                                64

                                                                                                     256



                                                                                                                  4096

                                                                                                                         16384
                                                                                                           1024
                                        1   2   4   8   16 32 64 128   Nodes
                    128.00                                                                                                                   16
                     64.00                                                                                                                   14
                     32.00                                               XK7+GPU
                                                                                                                                             12
                     16.00
                                                                                                                                             10
                 Time (s)




                      8.00




                                                                                                                                                  Time (s)
Liquid Crystal        4.00                                               XK6                                                                 8
                      2.00                                                                                                                   6
                      1.00
                                                                         XK6+GPU                                                             4
                      0.50
                      0.25                                                                                                                   2
                      0.13                                                                                                                   0
                                        1   2   4   8   16 32 64 128    Nodes




                                                                                   1

                                                                                       4
                                                                                            16

                                                                                                 64

                                                                                                            6


                                                                                                           96
                                                                                                           24



                                                                                                            4
                                                                                                          25




                                                                                                         38
                                                                                                        40
                                                                                                        10


                                                                                                       16
Early Titan XK6/XK7 Benchmarks
           18
                             Speedup with Acceleration on XK6/XK7 Nodes
           16
                                       1 Node = 32K Particles
           14
                                     900 Nodes = 29M Particles
           12
           10
            8
            6
            4
            2
            0
                  Atomic Fluid (cutoff Atomic Fluid (cutoff
                                                              Bulk Copper   Protein   Liquid Crystal
                       = 2.5σ)              = 5.0σ)
XK6 (1 Node)             1.92                 4.33               2.12         2.6         5.82
XK7 (1 Node)             2.90                 8.38               3.66        3.36         15.70
XK6 (900 Nodes)          1.68                 3.96               2.15        1.56         5.60
XK7 (900 Nodes)          2.75                 7.48               2.86        1.95         10.14
Recommended GPU Node Configuration for
         LAMMPS Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                 2
                    Cores per CPU socket                              6+
                      CPU speed (Ghz)                                2.66+
               System memory per socket (GB)                          32
                                                             Kepler K10, K20, K20X
                           GPUs
                                                          Fermi M2090, M2075, C2075
                  # of GPUs per CPU socket                            1-2
                GPU memory preference (GB)                             6
                   GPU to CPU connection                       PCIe 2.0 or higher

                       Server storage                          500 GB or higher

                    Network configuration                      Gemini, InfiniBand


13   Scale to multiple nodes with same single node configuration
GPU Test Drive
     Experience GPU Acceleration
     For Computational Chemistry
     Researchers, Biophysicists

     Preconfigured with Molecular
     Dynamics Apps

     Remotely Hosted GPU Servers

     Free & Easy – Sign up, Log in and
     See Results

     www.nvidia.com/gputestdrive
14

Weitere ähnliche Inhalte

Was ist angesagt?

Hd7950 sales kit
Hd7950 sales kitHd7950 sales kit
Hd7950 sales kitPowerColor
 
Turbo duo hd7790 sales kit
Turbo duo hd7790 sales kitTurbo duo hd7790 sales kit
Turbo duo hd7790 sales kitPowerColor
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceDESMOND YUEN
 
PowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...DESMOND YUEN
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesSubhajit Sahu
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Cascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckCascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckDESMOND YUEN
 
Exadata db node update
Exadata db node updateExadata db node update
Exadata db node updatepat2001
 

Was ist angesagt? (20)

Mateo valero p2
Mateo valero p2Mateo valero p2
Mateo valero p2
 
Ron perrot
Ron perrotRon perrot
Ron perrot
 
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
 
Hd7950 sales kit
Hd7950 sales kitHd7950 sales kit
Hd7950 sales kit
 
Turbo duo hd7790 sales kit
Turbo duo hd7790 sales kitTurbo duo hd7790 sales kit
Turbo duo hd7790 sales kit
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
 
PowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kit
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Core I7
Core I7Core I7
Core I7
 
54603 vsp vs300_fl5_ccah
54603 vsp vs300_fl5_ccah54603 vsp vs300_fl5_ccah
54603 vsp vs300_fl5_ccah
 
54647 01 vsp_vs300_fh4_dcaj
54647 01 vsp_vs300_fh4_dcaj54647 01 vsp_vs300_fh4_dcaj
54647 01 vsp_vs300_fh4_dcaj
 
knoSYS_Hardware
knoSYS_HardwareknoSYS_Hardware
knoSYS_Hardware
 
54645 01 vsp_vs300_fh5_dcah
54645 01 vsp_vs300_fh5_dcah54645 01 vsp_vs300_fh5_dcah
54645 01 vsp_vs300_fh5_dcah
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : Notes
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
55506 vsp vs300_av_ccaj
55506 vsp vs300_av_ccaj55506 vsp vs300_av_ccaj
55506 vsp vs300_av_ccaj
 
Cascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckCascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deck
 
Exadata db node update
Exadata db node updateExadata db node update
Exadata db node update
 

Ähnlich wie LAMMPS Molecular Dynamics on GPU

計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Gao Boyang
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdfbui thequan
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPURenee Yao
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
MT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionMT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionDell EMC World
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDocker, Inc.
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systemsjpaugh
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 

Ähnlich wie LAMMPS Molecular Dynamics on GPU (20)

計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
MT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionMT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussion
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 

LAMMPS Molecular Dynamics on GPU

  • 1. LAMMPS, Dec. 2011 or later
  • 2. Summary/Conclusions Benefits of GPU Accelerated Computing Faster than CPU only systems in all tests Large performance boost with small marginal price increase Energy usage cut in half GPUs scale very well within a node and over multiple nodes Tesla K20 GPU is our fastest and lowest power high performance GPU to date Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
  • 3. More Science for Your Money Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores 6 and 150W per CPU). 5.5 Green nodes have 2x E5-2687W and 1 5 or 2 NVIDIA K10, K20, or K20X GPUs (235W). Speedup Compared to CPU Only 4.5 4 3.3 2.92 3 2.47 2 1.7 1 0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes.
  • 4. K20X, the Fastest GPU Yet 7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 6 Green nodes have 2x E5-2687W and 2 NVIDIA M2090s or K20X GPUs (235W). Speedup Relative to CPU Alone 5 4 3 2 1 0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s
  • 5. Get a CPU Rebate to Fund Part of Your GPU Budget Acceleration in Loop Time Computation by Additional GPUs Running NAMD version 2.9 20 18.2 The blue node contains Dual X5670 CPUs 18 (6 Cores per CPU). 16 The green nodes contain Dual X5570 CPUs Normalized to CPU Only 14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090 GPUs. 12 9.88 10 8 6 5.31 4 2 0 1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090 Increase performance 18x when compared to CPU-only nodes Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs!
  • 6. Excellent Strong Scaling on Large Clusters LAMMPS Gay-Berne 134M Atoms 600 GPU Accelerated XK6 500 CPU only XE6 Loop Time (seconds) 400 3.55x 300 200 3.48x 3.45x 100 0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
  • 7. GPUs Sustain 5x Performance for Weak Scaling Weak Scaling with 32K Atoms per Node 45 40 Loop Time (seconds) 35 30 6.7x 5.8x 4.8x 25 20 15 10 5 0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
  • 8. Faster, Greener — Worth It! Energy Consumed in one loop of EAM 140 120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only 100 Energy Expended (kJ) 80 60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs 40 20 0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.
  • 9. Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012
  • 10. Early Kepler Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00 4.00 XK6 3 Time (s) Atomic Fluid 2.00 Time (s) XK6+GPU 1.00 2 0.50 XK7+GPU 0.25 XK6 0.13 1 XK6+GPU 0.06 0.03 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16 3.0 8.00 XK7+GPU 2.5 4.00 2.0 Time (s) 2.00 Time (s) Bulk Copper XK6 1.5 1.00 1.0 0.50 XK6+GPU 0.5 0.25 0.0 0.13 Nodes 1 4 16 64 6 96 24 4 25 38 1 2 4 8 16 32 64 128 40 10 16
  • 11. Early Kepler Benchmarks on Titan 64.00 32 32.00 XK7+GPU 16 16.00 Time (s) Protein 8 Time (s) 8.00 XK6 4.00 4 2.00 XK6+GPU 2 1.00 0.50 1 1 4 16 64 256 4096 16384 1024 1 2 4 8 16 32 64 128 Nodes 128.00 16 64.00 14 32.00 XK7+GPU 12 16.00 10 Time (s) 8.00 Time (s) Liquid Crystal 4.00 XK6 8 2.00 6 1.00 XK6+GPU 4 0.50 0.25 2 0.13 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16
  • 12. Early Titan XK6/XK7 Benchmarks 18 Speedup with Acceleration on XK6/XK7 Nodes 16 1 Node = 32K Particles 14 900 Nodes = 29M Particles 12 10 8 6 4 2 0 Atomic Fluid (cutoff Atomic Fluid (cutoff Bulk Copper Protein Liquid Crystal = 2.5σ) = 5.0σ) XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82 XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70 XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60 XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14
  • 13. Recommended GPU Node Configuration for LAMMPS Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand 13 Scale to multiple nodes with same single node configuration
  • 14. GPU Test Drive Experience GPU Acceleration For Computational Chemistry Researchers, Biophysicists Preconfigured with Molecular Dynamics Apps Remotely Hosted GPU Servers Free & Easy – Sign up, Log in and See Results www.nvidia.com/gputestdrive 14

Hinweis der Redaktion

  1. CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  2. CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  3. Config: loop time:2x X5670 (HP Z800) 2717.6301xM2090 (2xX5570)511.7502xM2090 (2xX5570)274.9703xM2090 (2xX5570)210.4304xM2090 (2xX5570)148.880
  4. nodes:300400500600700800900CPU-only time:563.96423.83339.62281.58260.98220.83203.13CPU+GPU time: 159.06118.6296.4481.0371.5763.7658.96GPU speedup ratio:  3.553.573.523.483.653.463.45
  5. Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
  6. Power WTimeenergy spentCpu 300 382 114Cpu 1 k20x 535 130 69Cpu 2 k20x 770 70 54
  7. Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.