2. Summary/Conclusions
Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests
Large performance boost with small marginal price increase
Energy usage cut in half
GPUs scale very well within a node and over multiple nodes
Tesla K20 GPU is our fastest and lowest power high performance GPU to date
Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
3. More Science for Your Money
Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores
6 and 150W per CPU).
5.5
Green nodes have 2x E5-2687W and 1
5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only
4.5
4
3.3
2.92
3
2.47
2 1.7
1
0
CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x
K10 K20 K20X K10 K20 K20X
Experience performance increases of up to 5.5x with Kepler GPU nodes.
4. K20X, the Fastest GPU Yet
7 Blue node uses 2x E5-2687W (8 Cores
and 150W per CPU).
6
Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone
5
4
3
2
1
0
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X
Experience performance increases of up to 6.2x with Kepler GPU nodes.
One K20X performs as well as two M2090s
5. Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in Loop Time Computation by
Additional GPUs
Running NAMD version 2.9
20
18.2
The blue node contains Dual X5670 CPUs
18
(6 Cores per CPU).
16
The green nodes contain Dual X5570 CPUs
Normalized to CPU Only
14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090
GPUs.
12
9.88
10
8
6 5.31
4
2
0
1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090
Increase performance 18x when compared to CPU-only nodes
Cheaper CPUs used with GPUs AND still faster overall performance when
compared to more expensive CPUs!
6. Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms
600
GPU Accelerated XK6
500
CPU only XE6
Loop Time (seconds)
400
3.55x
300
200
3.48x
3.45x
100
0
300 400 500 600 700 800 900
Nodes
From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
7. GPUs Sustain 5x Performance for Weak Scaling
Weak Scaling with 32K Atoms per Node
45
40
Loop Time (seconds) 35
30
6.7x 5.8x 4.8x
25
20
15
10
5
0
1 8 27 64 125 216 343 512 729
Nodes
Performance of 4.8x-6.7x with GPU-accelerated nodes
when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
8. Faster, Greener — Worth It!
Energy Consumed in one loop of EAM
140
120 GPU-accelerated computing uses
Lower is better 53% less energy than CPU only
100
Energy Expended (kJ)
80
60
Energy Expended = Power x Time
Power calculated by combining the component’s TDPs
40
20
0
1 Node 1 Node + 1 K20X 1 Node + 2x K20X
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.
9. Molecular Dynamics with LAMMPS
on a Hybrid Cray Supercomputer
W. Michael Brown
National Center for Computational Sciences
Oak Ridge National Laboratory
NVIDIA Technology Theater, Supercomputing 2012
November 14, 2012
13. Recommended GPU Node Configuration for
LAMMPS Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 6+
CPU speed (Ghz) 2.66+
System memory per socket (GB) 32
Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075
# of GPUs per CPU socket 1-2
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand
13 Scale to multiple nodes with same single node configuration
14. GPU Test Drive
Experience GPU Acceleration
For Computational Chemistry
Researchers, Biophysicists
Preconfigured with Molecular
Dynamics Apps
Remotely Hosted GPU Servers
Free & Easy – Sign up, Log in and
See Results
www.nvidia.com/gputestdrive
14
Hinweis der Redaktion
CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.