This document provides information on molecular dynamics (MD) and quantum chemistry applications that have been optimized to take advantage of GPU acceleration. It lists the key applications in each category along with notes on supported features, reported GPU speedups compared to CPU performance, and release status including support for single or multiple GPUs. The applications cover a wide range of computational chemistry and bioinformatics domains and demonstrate the growing use of GPUs for scientific computing.
2. Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks
2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump
Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
Multi-GPU, multi-node
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx
165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Multi-GPU, multi-node
Release 4.6; 1st Multi-GPU support
4X C2075s
http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
Multi-GPU, multi-node
p and
http://lammps.sandia.gov/bench.html#titan
4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
3. New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.
Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.
150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs
Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding
High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)
Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013
Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
4. Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local
Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org
Abinit diagonalization /
1.3-2.7X
Multi-GPU support
orthogonalization
Integrating scheduling GPU into http://www.olcf.ornl.gov/wp-
Under development
ACES III SIAL programming language and 10X on kernels
Multi-GPU support
content/training/electronic-structure-
SIP runtime environment 2012/deumens_ESaccel_2012.pdf
Pilot project completed,
ADF Fock Matrix, Hessians TBD Under development www.scm.com
Multi-GPU support
http://inac.cea.fr/L_Sim/BigDFT/news.html,
http://www.olcf.ornl.gov/wp-
5-25X Released June 2009, content/training/electronic-structure-
DFT; Daubechies wavelets,
BigDFT part of Abinit
(1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and
GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp-
content/training/electronic-structure-
2012/BigDFT-HPC-tues.pdf
Under development,
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
Casino TBD TBD Spring 2013 release
html
Multi-GPU support
http://www.olcf.ornl.gov/wp-
DBCSR (spare matrix multiply Under development
CP2K library)
2-7X
Multi-GPU support
content/training/ascc_2012/friday/ACSS_2012_V
andeVondele_s.pdf
Libqc with Rys Quadrature
1.3-1.6X, Released Next release Q4 2012.
GAMESS-US Algorithm, Hartree-Fock, MP2
2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html
and CCSD in Q4 2012
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
5. Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
(ss|ss) type integrals within
calculations using Hartree Fock ab
Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419
GAMESS-UK initio methods and density 8x
Multi-GPU support 63
functional theory. Supports
organics & inorganics.
Under development
Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011
Gaussian Collaboration
TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm
Electrostatic poisson equation,
Released
orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
GPAW residual minimization method
8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
(rmm-diis)
Under development
Schrodinger, Inc.
Jaguar Investigating GPU acceleration TBD Multi-GPU support
http://www.schrodinger.com/kb/278
Released, Version 7.8
MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org
support coming in Version 8
Density-fitted MP2 (DF-MP2),
1.7-2.3X Under development www.molpro.net
MOLPRO density fitted local correlation
projected Multiple GPU Hans-Joachim Werner
methods (DF-RHF, DF-KS), DFT
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
6. Quantum Chemistry Applications
Features
Application GPU Perf Release Status Notes
Supported
pseudodiagonalization, full
Under Development Academic port.
MOPAC2009 diagonalization, and density 3.8-14X
Single GPU http://openmopac.net
matrix assembling
Development GPGPU benchmarks:
Triples part of Reg-CCSD(T), www.nwchem-sw.org
Release targeting March 2013
NWChem CCSD & EOMCCSD task 3-10X projected
Multiple GPUs
And http://www.olcf.ornl.gov/wp-
schedulers content/training/electronic-structure-
2012/Krishnamoorthy-ESCMA12.pdf
Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/
Density functional theory (DFT) First principles materials code that computes
Released
PEtot plane wave pseudopotential 6-10X
Multi-GPU
the behavior of the electron structures of
calculations materials
http://www.q-
Q-CHEM RI-MP2 8x-14x Released, Version 4.0
chem.com/doc_for_web/qchem_manual_4.0.pdf
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
7. Quantum Chemistry Applications
Features
Application GPU Perf Release Status Notes
Supported
NCSA
Released University of Illinois at Urbana-Champaign
QMCPACK Main features 3-4x
Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php
/GPU_version_of_QMCPACK
Created by Irish Centre for
Quantum PWscf package: linear algebra
(matrix multiply), explicit 2.5-3.5x
Released
Version 5.0
High-End Computing
http://www.quantum-espresso.org/index.php
Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs
and http://www.quantum-espresso.org/
Completely redesigned to
exploit GPU parallelism. YouTube:
44-650X vs. Released
http://youtu.be/EJODzk6RFxE?hd=1 and
TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5
http://www.olcf.ornl.gov/wp-
version Multi-GPU/single node content/training/electronic-structure-
2012/Luehr-ESCMA.pdf
2x
Hybrid Hartree-Fock DFT
2 GPUs Available on request By Carnegie Mellon University
VASP functionals including exact
comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf
exchange
128 CPU cores
Generalized Wang-Landau
3x
Under development GPU Perf Electronic Structure Determination Workshop 2012:
NICS
compared against Multi-core x86 CPU socket.
http://www.olcf.ornl.gov/wp-
WL-LSMS method
with 32 GPUs vs.
Multi-GPU support GPU Perf benchmarked on GPU supported features
content/training/electronic-structure-
32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison
may be a kernel to kernel perf
8. Viz, ―Docking‖ and Related Applications Growing
Related Features
GPU Perf Release Status Notes
Applications Supported
Visualization from Visage Imaging. Next release, 5.4, will use
3D visualization of volumetric Released, Version 5.3.3
Amira 5® data and surfaces
70x
Single GPU
GPU for general purpose processing in some functions
http://www.visageimaging.com/overview.html
High-Throughput parallel blind Virtual Screening,
Allows fast processing of large Available upon request to
BINDSURF ligand databases
100X
authors; single GPU
http://www.biomedcentral.com/1471-2105/13/S14/S13
Empirical Free Released University of Bristol
BUDE Energy Forcefield
6.5-13.4X
Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm
Released, Suite 2011 Schrodinger, Inc.
Core Hopping GPU accelerated application 3.75-5000X
Single and multi-GPUs. http://www.schrodinger.com/products/14/32/
Real-time shape similarity Released Open Eyes Scientific Software
FastROCS searching/comparison
800-3000X
Single and multi-GPUs. http://www.eyesopen.com/fastrocs
Lines: 460% increase
Cartoons: 1246% increase
Released, Version 1.5
PyMol Surface: 1746% increase 1700x
Single GPUs
http://pymol.org/
Spheres: 753% increase
Ribbon: 426% increase
High quality rendering, GPU Perf compared against Multi-core x86 CPU socket.
large structures (100 million atoms),
100-125X or greater GPU Perf benchmarked on GPU supported features
Visualization from University of Illinois at Urbana-Champaign
VMD analysis and visualization tasks, multiple
on kernels
Released, Version 1.9
and mayhttp://www.ks.uiuc.edu/Research/vmd/
be a kernel to kernel perf comparison
GPU support for display of molecular
9. Bioinformatics Applications
Features GPU
Application Release Status Website
Supported Speedup
Alignment of short sequencing Version 0.6.2 – 3/2012
BarraCUDA reads
6-10x
Multi-GPU, multi-node
http://seqbarracuda.sourceforge.net/
Parallel search of Smith- Version 2.0.8 – Q1/2012
CUDASW++ Waterman database
10-50x
Multi-GPU, multi-node
http://sourceforge.net/projects/cudasw/
Parallel, accurate long read Version 1.0.40 – 6/2012
CUSHAW aligner for large genomes
10x
Multiple-GPU
http://cushaw.sourceforge.net/
Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu
GPU-BLAST BLASTP
3-4x
Single GPU blast.html
Parallel local and global
Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH
GPU-HMMER search of Hidden Markov 60-100x
Multi-GPU, multi-node MMER.htm
Models
Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa
mCUDA-MEME algorithm based on MEME
4-10x
Multi-GPU, multi-node re/mcuda-meme
Hardware and software for
Released.
SeqNFind reference assembly, blast, SW, 400x
Multi-GPU, multi-node
http://www.seqnfind.com/
HMM, de novo assembly
Version 1.11 – 5/2012
UGENE Fast short read alignment 6-8x
Multi-GPU, multi-node
http://ugene.unipro.ru/
GPU Perf compared against same or similar code running on single CPU machine
Parallel linear regression on Performance measured internally or independently
10.
11. MD Average Speedups
The blue node contains Dual E5-2687W CPUs
10 (8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only
8 K20X GPUs.
6
4
2
0
CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X
Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
Error bars show the maximum and minimum speedup for each hardware configuration.
12. Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks
2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump
Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
Multi-GPU, multi-node
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx
165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Multi-GPU, multi-node
Release 4.6; 1st Multi-GPU support
4X C2075s
http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
Multi-GPU, multi-node
p and
http://lammps.sandia.gov/bench.html#titan
4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
13. New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.
Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.
150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs
Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding
High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)
Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013
Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
14. Built from Ground Up for GPUs
Computational Chemistry
Study disease & discover drugs
What
Predict drug and protein interactions
GPU READY
Speed of simulations is critical APPLICATIONS
Why Enables study of:
Abalone
ACEMD
Longer timeframes AMBER
Larger systems DL_PLOY
More simulations GAMESS
How GPUs increase throughput & accelerate simulations
GROMACS
LAMMPS
NAMD
AMBER 11 Application NWChem
4.6x performance increase with 2 GPUs with Q-CHEM
only a 54% added cost* Quantum Espresso
TeraChem
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
15. AMBER 12
GPU Support Revision 12.2
1/22/2013
15
16. Kepler - Our Fastest Family of GPUs Yet
30.00
Factor IX Running AMBER 12 GPU Support Revision 12.1
25.39
25.00 The blue node contains Dual E5-2687W CPUs
22.44 (8 Cores per CPU).
7.4x The green nodes contain Dual E5-2687W CPUs (8
20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10
Nanoseconds / Day
or 1x K20 for the GPU
6.6x
15.00
11.85 5.6x
10.00
3.5x
5.00
3.42
0.00
Factor IX
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
M2090
GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
when compared to a CPU only node
16
17. K10 Accelerates Simulations of All Sizes
30
Running AMBER 12 GPU Support Revision 12.1
The blue node contains Dual E5-2687W CPUs
25 24.00 (8 Cores per CPU).
Speedup Compared to CPU Only
The green nodes contain Dual E5-2687W CPUs (8
19.98
20 Cores per CPU) and 1x NVIDIA K10 GPU
15
10
5.50 5.53 5.04
5
2.00
0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB
Gain 24x performance by adding just 1 GPU
Nucleosome
when compared to dual CPU performance
18. K20 Accelerates Simulations of All Sizes
30.00
28.00
Running AMBER 12 GPU Support Revision 12.1
25.56 SPFP with CUDA 4.2.9 ECC Off
25.00
The blue node contains 2x Intel E5-2687W CPUs
Speedup Compared to CPU Only
(8 Cores per CPU)
20.00
Each green nodes contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
15.00
10.00
7.28
6.50 6.56
5.00
2.66
1.00
0.00
CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome
Molecules PME PME GB
Gain 28x throughput/performance by adding just one K20 GPU
Nucleosome
when compared to dual CPU performance
18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
19. K20X Accelerates Simulations of All Sizes
35
31.30 Running AMBER 12 GPU Support Revision 12.1
30 28.59
The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedup Compared to CPU Only
25
The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1x NVIDIA K20X GPU
20
15
10 8.30
7.15 7.43
5
2.79
0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB
Gain 31x performance by adding just one K20X GPU
Nucleosome
when compared to dual CPU performance
19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
20. K10 Strong Scaling over Nodes
Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off
6 The blue nodes contains 2x Intel X5670
CPUs (6 Cores per CPU)
5 The green nodes contains 2x Intel X5670
CPUs (6 Cores per CPU) plus 2x NVIDIA
K10 GPUs
4
Nanoseconds / Day
2.4x
3
CPU Only
3.6x With GPU
2
5.1x
1
Cellulose
0
1 2 4
Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes
21. Kepler – Universally Faster
9
Running AMBER 12 GPU Support Revision 12.1
8 The CPU Only node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedups Compared to CPU Only
7
The Kepler nodes contain Dual E5-2687W CPUs (8
6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
GPUs
5
JAC
4 Factor IX
Cellulose
3
2
1
0
CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose
The Kepler GPUs accelerated all simulations, up to 8x
22. K10 Extreme Performance
Running AMBER 12 GPU Support Revision 12.1
JAC 23K Atoms (NVE)
120 The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).
97.99 The green node contain Dual E5-2687W CPUs (8
100
Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day
80
60
40
20
12.47
0
1 Node 1 Node
DHFR
Gain 7.8X performance by adding just 2 GPUs
when compared to dual CPU performance
23. K20 Extreme Performance
DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
120
The blue node contains 2x Intel E5-2687W CPUs
95.59 (8 Cores per CPU)
100
Each green node contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
Nanoseconds / Day
80
60
40
20 12.47
0
1 Node 1 Node
DHFR
Gain > 7.5X throughput/performance by adding just 2 K20 GPUs
when compared to dual CPU performance
23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
24. Replace 8 Nodes with 1 K20 GPU
90.00 35000
$32,000.00
Running AMBER 12 GPU Support Revision 12.1
81.09 SPFP with CUDA 4.2.9 ECC Off
80.00
30000
The eight (8) blue nodes each contain 2x Intel
70.00 E5-2687W CPUs (8 Cores per CPU)
65.00
25000
Each green node contains 2x Intel E5-2687W
60.00
CPUs (8 Cores per CPU) plus 1x NVIDIA K20
GPU
50.00 20000
Note: Typical CPU and GPU node pricing used.
40.00 Pricing may vary depending on node
15000
configuration. Contact your preferred HW vendor
for actual pricing.
30.00
10000
20.00 $6,500.00
5000
10.00
0.00 0
Nanoseconds/Day Cost
DHFR
Cut down simulation costs to ¼ and gain higher performance
24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
25. Replace 7 Nodes with 1 K10 GPU
Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
80 $35,000.00
$32,000
The eight (8) blue nodes each contain 2x Intel
70 $30,000.00 E5-2687W CPUs (8 Cores per CPU)
60
The green node contains 2x Intel E5-2687W
$25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10
Nanoseconds / Day
GPU
50
$20,000.00 Note: Typical CPU and GPU node pricing used.
40 Pricing may vary depending on node
$15,000.00 configuration. Contact your preferred HW vendor
30 for actual pricing.
$10,000.00
20 $7,000
10 $5,000.00
0 $0.00
CPU Only GPU Enabled CPU Only GPU Enabled
DHFR
Cut down simulation costs to ¼ and increase performance by 70%
25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
26. Extra CPUs decrease Performance
Cellulose NVE Running AMBER 12 GPU Support Revision 12.1
8 The orange bars contains one E5-2687W CPUs
(8 Cores per CPU).
7
The blue bars contain Dual E5-2687W CPUs (8
6 Cores per CPU)
Nanoseconds / Day
2 CPUs 2 GPUs
1 CPU 2 GPUs
5
4 1 E5-2687W
2 E5-2687W
3
2
1
0 Cellulose
CPU Only CPU with dual K20s
When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
27. Kepler - Greener Science
Running AMBER 12 GPU Support Revision 12.1
Energy used in simulating 1 ns of DHFR JAC
2500 The blue node contains Dual E5-2687W CPUs
(150W each, 8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs (8
2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
Lower is better GPUs (235W each).
Energy Expended (kJ)
1500
Energy Expended
1000
= Power x Time
500
0
CPU Only CPU + K10 CPU + K20 CPU + K20X
The GPU Accelerated systems use 65-75% less energy
28. Recommended GPU Node Configuration for
AMBER Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 4+ (1 CPU core drives 1 GPU)
CPU speed (Ghz) 2.66+
System memory per node (GB) 16
Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075
1-2
# of GPUs per CPU socket (4 GPUs on 1 socket is good
to do 4 fast serial GPU runs)
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 16x or higher
Server storage 2 TB
28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
29. Benefits of GPU AMBER Accelerated Computing
Faster than CPU only systems in all tests
Most major compute intensive aspects of classical MD ported
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and over multiple nodes
K20 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive
29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
31. Kepler - Our Fastest Family of GPUs Yet
4.50
ApoA1 Running NAMD version 2.9
4.00
4.00 The blue node contains Dual E5-2687W CPUs
3.57 (8 Cores per CPU).
3.45
3.50
The green nodes contain Dual E5-2687W CPUs (8
2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10
3.00 or 1x K20 for the GPU
Nanoseconds/Day
2.63
2.6x
2.50
2.5x
2.00
1.50 1.37 1.9x
1.00
0.50
0.00
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
Apolipoprotein A1
M2090
GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
when compared to a CPU only node
31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
32. Accelerates Simulations of All Sizes
3
Running NAMD 2.9 with CUDA 4.0 ECC Off
2.7
2.6
The blue node contains 2x Intel E5-2687W CPUs
2.5 2.4
(8 Cores per CPU)
Speedup Compared to CPU Only
Each green node contains 2x Intel E5-2687W
2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
1.5
1
0.5
0
CPU All Molecules ApoA1 F1-ATPase STMV
Apolipoprotein A1
Gain 2.5x throughput/performance by adding just 1 GPU
when compared to dual CPU performance
32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
33. Kepler – Universally Faster
6
Running NAMD version 2.9
The CPU Only node contains Dual E5-2687W CPUs
5 (8 Cores per CPU).
Speedup Compared to CPU Only
5.1x The Kepler nodes contain Dual E5-2687W CPUs (8
4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or
K20X GPUs.
4.3x
F1-ATPase
3
ApoA1
STMV
2.9x
2
2.6x
2.4x
1
0
CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X
F1-ATPase
| Kepler nodes use Dual CPUs |
The Kepler GPUs accelerate all simulations, up to 5x
Average acceleration printed in bars
34. Outstanding Strong Scaling with Multi-STMV
Running NAMD version 2.9
Each blue XE6 CPU node contains 1x AMD
100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU).
1.2
Fermi XK6 Each green XK6 CPU+GPU node contains
1x AMD 1600 Opteron (16 Cores per CPU)
1 and an additional 1x NVIDIA X2090 GPU.
CPU XK6
2.7x
Nanoseconds / Day
0.8
2.9x
0.6
0.4
0.2
3.6x
3.8x Concatenation of 100
0 Satellite Tobacco Mosaic Virus
32 64 128 256 512 640 768
# of Nodes
Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
35. Replace 3 Nodes with 1 2090 GPU
Running NAMD version 2.9
Each blue node contains 2x Intel Xeon X5550 CPUs
F1-ATPase (4 Cores per CPU).
4 CPU Nodes
0.8 9000
0.74 The green node contains 2x Intel Xeon X5550 CPUs
$8,000
1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7 1x M2090 GPUs
0.63
7000 Note: Typical CPU and GPU node pricing used. Pricing
0.6 may vary depending on node configuration. Contact your
6000 preferred HW vendor for actual pricing.
0.5
5000
0.4 $4,000
4000
0.3
3000
0.2
2000
0.1 1000
0 0
Nanoseconds/Day Cost
Speedup of 1.2x for 50% the cost F1-ATPase
36. K20 - Greener: Twice The Science Per Watt
1200000
Energy Used in Simulating 1 Nanosecond of ApoA1
Running NAMD version 2.9
1000000 Each blue node contains Dual E5-2687W
CPUs (95W, 4 Cores per CPU).
Each green node contains 2x Intel Xeon X5550
Energy Expended (kJ)
800000
CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
Lower is better K20 GPUs (225W per GPU)
600000
Energy Expended
400000
= Power x Time
200000
0
1 Node 1 Node + 2x K20
Cut down energy usage by ½ with GPUs
36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
37. Kepler - Greener: Twice The Science/Joule
Energy used in simulating 1 ns of SMTV
250000
Running NAMD version 2.9
The blue node contains Dual E5-2687W CPUs
200000 (150W each, 8 Cores per CPU).
Energy Expended (kJ)
Lower is better The green nodes contain Dual E5-2687W CPUs
(8 Cores per CPU) and 2x NVIDIA K10, K20, or
150000
K20X GPUs (235W each).
Energy Expended
100000
= Power x Time
50000
0
CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs
Cut down energy usage by ½ with GPUs
Satellite Tobacco Mosaic Virus
38. Recommended GPU Node Configuration for
NAMD Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 6+
CPU speed (Ghz) 2.66+
System memory per socket (GB) 32
Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075
# of GPUs per CPU socket 1-2
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand
38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
39. Summary/Conclusions
Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests
Large performance boost with small marginal price increase
Energy usage cut in half
GPUs scale very well within a node and over multiple nodes
Tesla K20 GPU is our fastest and lowest power high performance GPU to date
Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive
39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
41. More Science for Your Money
Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores
6 and 150W per CPU).
5.5
Green nodes have 2x E5-2687W and 1
5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only
4.5
4
3.3
2.92
3
2.47
2 1.7
1
0
CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x
K10 K20 K20X K10 K20 K20X
Experience performance increases of up to 5.5x with Kepler GPU nodes.
42. K20X, the Fastest GPU Yet
7 Blue node uses 2x E5-2687W (8 Cores
and 150W per CPU).
6
Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone
5
4
3
2
1
0
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X
Experience performance increases of up to 6.2x with Kepler GPU nodes.
One K20X performs as well as two M2090s
43. Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in Loop Time Computation by
Additional GPUs
Running NAMD version 2.9
20
18.2
The blue node contains Dual X5670 CPUs
18
(6 Cores per CPU).
16
The green nodes contain Dual X5570 CPUs
Normalized to CPU Only
14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090
GPUs.
12
9.88
10
8
6 5.31
4
2
0
1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090
Increase performance 18x when compared to CPU-only nodes
Cheaper CPUs used with GPUs AND still faster overall performance when
compared to more expensive CPUs!
44. Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms
600
GPU Accelerated XK6
500
CPU only XE6
Loop Time (seconds)
400
3.55x
300
200
3.48x
3.45x
100
0
300 400 500 600 700 800 900
Nodes
From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
45. GPUs Sustain 5x Performance for Weak Scaling
Weak Scaling with 32K Atoms per Node
45
40
Loop Time (seconds) 35
30
6.7x 5.8x 4.8x
25
20
15
10
5
0
1 8 27 64 125 216 343 512 729
Nodes
Performance of 4.8x-6.7x with GPU-accelerated nodes
when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
46. Faster, Greener — Worth It!
Energy Consumed in one loop of EAM
140
120 GPU-accelerated computing uses
Lower is better 53% less energy than CPU only
100
Energy Expended (kJ)
80
60
Energy Expended = Power x Time
Power calculated by combining the component’s TDPs
40
20
0
1 Node 1 Node + 1 K20X 1 Node + 2x K20X
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.
Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
47. Molecular Dynamics with LAMMPS
on a Hybrid Cray Supercomputer
W. Michael Brown
National Center for Computational Sciences
Oak Ridge National Laboratory
NVIDIA Technology Theater, Supercomputing 2012
November 14, 2012
Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day6025.342.4price6000030004000
44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day 6025.342.4price6000030004000scaled price1 0.050.066666667perf/price18.43333333310.6
64 cpu2 cpu 1 gpu2 cpu 2 gpuNs/day 318.915.1tdp6080428666sec/ns2787.09679707.86515721.8543energy/ns16945.5484154.96623810.7549
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Test case not specified in perf lab run
Test case not specified in perf lab run
I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run Quantum Espresso/PWscf and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.What is Quantum Espresso/PWscf:-A set of programs used to calculate the electron configuration of atoms or molecules-Uses plane wave basis sets and quantum mechanical principles-Highly compute intensiveBenefits of GPU-accelerated Computing:-Faster than CPU only systems in all tests-Performance boost much larger than marginal price increase-Power consumption more than halved in all simulations-GPUs scale very well on clusters with dozens of nodes, and beyond===============Price assumes FERMI workstation ~$4000 and C2050 $1000shilu 3 water on calcite6 OpenMP CPU nodes 1025 15606 OpenMP CPU nodes 1 gpu 275 480FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
ausurf k ptausurf gamma 6 OpenMP CPU nodes 7100 s 7000 s 6 OpenMP CPU nodes 1 gpu 2350 s 2000 s FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
National average 9.83 cents/kWh kWh/sim tests/year $/test yearly energy billCPU 42.372357 4.16 $9816GPU/CPU 23.214 2400 2.28 $5476CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
total # of core2 (16) 4 (32)6 (48)8 (64)10 (80)12 (96)14 (112)time (s) cpu3100016500110009500750060005500time gpu+cpu12500700050004500350030002500SPEEDUP2.482.3571432.22.1111112.14285722.2STONEY (ICHEC): Bull Novascale R422-E2, 24 GPU nodesCPU: 2 Intel (Nehalem EP) Xeon X5560, 48 GByte RAMGPU: 2 x M2090SW: CUDA 4.0, Intel compilers
# of cores4(48)8(96)12(144)16(192)24(288)32(384)44(528)time cpu3925265025252450174012901337time gpu+cpu242514371075900737675637SPEEDUP1.6185571.844122.3488372.7222222.3609231.9111112.098901PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run TeraChem and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.Benefits of GPU Acceleration with TeraChem-Compete with Supercomputers-More powerful hardware-Significantly lower energy usage
Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.