SlideShare ist ein Scribd-Unternehmen logo
1 von 119
update




         Updated: February 4, 2013
Molecular Dynamics (MD) Applications
                      Features
 Application                                      GPU Perf              Release Status                        Notes/Benchmarks
                     Supported
                                                   > 100 ns/day                                              AMBER 12, GPU Revision Support 12.2
                PMEMD Explicit Solvent & GB                                      Released
   AMBER             Implicit Solvent
                                                  JAC NVE on 2X
                                                                          Multi-GPU, multi-node
                                                                                                            http://ambermd.org/gpus/benchmarks.
                                                       K20s                                                           htm#Benchmarks


                                                  2x C2070 equals                                                      Release C37b1;
                  Implicit (5x), Explicit (2x)                                   Released
  CHARMM            Solvent via OpenMM
                                                   32-35x X5667
                                                                     Single & multi-GPU in single node
                                                                                                         http://www.charmm.org/news/c37b1.html#po
                                                       CPUs                                                                stjump


                 Two-body Forces, Link-cell                                                                      Source only, Results Published
                                                                             Release V 4.03
  DL_POLY        Pairs, Ewald SPME forces,              4x
                                                                          Multi-GPU, multi-node
                                                                                                         http://www.stfc.ac.uk/CSE/randd/ccg/softwa
                         Shake VV                                                                                   re/DL_POLY/25526.aspx


                                                    165 ns/Day
                                                                                 Released
  GROMACS         Implicit (5x), Explicit (2x)       DHFR on
                                                                          Multi-GPU, multi-node
                                                                                                              Release 4.6; 1st Multi-GPU support
                                                    4X C2075s

                                                                                                         http://lammps.sandia.gov/bench.html#deskto
                 Lennard-Jones, Gay-Berne,                                      Released.
  LAMMPS       Tersoff & many more potentials
                                                  3.5-18x on Titan
                                                                          Multi-GPU, multi-node
                                                                                                                            p and
                                                                                                          http://lammps.sandia.gov/bench.html#titan


                                                    4.0 ns/days                  Released
               Full electrostatics with PME and
   NAMD            most simulation features
                                                   F1-ATPase on            100M atom capable                              NAMD 2.9
                                                      1x K20X             Multi-GPU, multi-node
                                                                                          GPU Perf compared against Multi-core x86 CPU socket.
                                                                                             GPU Perf benchmarked on GPU supported features
                                                                                                  and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping
                      Features
Application                                        GPU Perf                Release Status                                  Notes
                     Supported
                                                         4-29X              Released, Version 1.8.51
  Abalone         Simulations (on 1060 GPU)
                                                     (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                 Computation of non-valent               4-29X              Released, Version 1.1.4
  Ascalaph             interactions                  (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                                                  150 ns/day DHFR on                Released                 Production bio-molecular dynamics (MD)
   ACEMD        Written for use only on GPUs
                                                        1x K20               Single and multi-GPUs         software specially optimized to run on GPUs

               Powerful distributed computing
                                                     Depends upon                 Released;                       http://folding.stanford.edu
Folding@Home    molecular dynamics system;
                                                    number of GPUs              GPUs and CPUs                    GPUs get 4X the points of CPUs
                 implicit solvent and folding

                 High-performance all-atom
                                                     Depends upon                 Released;                         http://www.gpugrid.net/
GPUGrid.net       biomolecular simulations;
                                                    number of GPUs             NVIDIA GPUs only
                 explicit solvent and binding
                   Simple fluids and binary
               mixtures (pair potentials, high-    Up to 66x on 2090        Released, Version 0.2.0       http://halmd.org/benchmarks.html#supercool
   HALMD       precision NVE and NVT, dynamic       vs. 1 CPU core                Single GPU                     ed-binary-mixture-kob-andersen
                         correlations)

                                                    Kepler 2X faster        Released, Version 0.11.2        http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue      Written for use only on GPUs
                                                      than Fermi         Single and multi-GPU on 1 node          Multi-GPU w/ MPI in March 2013

                                                   Implicit: 127-213
                Implicit and explicit solvent,                               Released Version 4.1.1       Library and application for molecular dynamics
  OpenMM                custom forces
                                                  ns/day Explicit: 18-
                                                                                   Multi-GPU                           on high-performance
                                                    55 ns/day DHFR
                                                                                            GPU Perf compared against Multi-core x86 CPU socket.
                                                                                               GPU Perf benchmarked on GPU supported features
                                                                                                   and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
 Application   Features Supported                GPU Perf         Release Status                                     Notes
                Local Hamiltonian, non-local
               Hamiltonian, LOBPCG algorithm,                     Released; Version 7.0.5                         www.abinit.org
   Abinit             diagonalization /
                                                   1.3-2.7X
                                                                    Multi-GPU support
                      orthogonalization
               Integrating scheduling GPU into                                                            http://www.olcf.ornl.gov/wp-
                                                                    Under development
   ACES III    SIAL programming language and     10X on kernels
                                                                     Multi-GPU support
                                                                                                      content/training/electronic-structure-
                   SIP runtime environment                                                              2012/deumens_ESaccel_2012.pdf
                                                                  Pilot project completed,
    ADF             Fock Matrix, Hessians             TBD            Under development                             www.scm.com
                                                                      Multi-GPU support
                                                                                                   http://inac.cea.fr/L_Sim/BigDFT/news.html,
                                                                                                           http://www.olcf.ornl.gov/wp-
                                                     5-25X          Released June 2009,               content/training/electronic-structure-
                 DFT; Daubechies wavelets,
   BigDFT              part of Abinit
                                                 (1 CPU core to     current release 1.6.0                 2012/BigDFT-Formalism.pdf and
                                                  GPU kernel)        Multi-GPU support                     http://www.olcf.ornl.gov/wp-
                                                                                                      content/training/electronic-structure-
                                                                                                             2012/BigDFT-HPC-tues.pdf
                                                                    Under development,
                                                                                                 http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
   Casino                   TBD                       TBD           Spring 2013 release
                                                                                                                    html
                                                                     Multi-GPU support
                                                                                                         http://www.olcf.ornl.gov/wp-
                DBCSR (spare matrix multiply                        Under development
    CP2K                  library)
                                                     2-7X
                                                                     Multi-GPU support
                                                                                                 content/training/ascc_2012/friday/ACSS_2012_V
                                                                                                                andeVondele_s.pdf
                  Libqc with Rys Quadrature
                                                   1.3-1.6X,              Released                               Next release Q4 2012.
 GAMESS-US      Algorithm, Hartree-Fock, MP2
                                                  2.3-2.9x HF        Multi-GPU support               http://www.msg.ameslab.gov/gamess/index.html
                     and CCSD in Q4 2012
                                                                                             GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                GPU Perf benchmarked on GPU supported features
                                                                                                    and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
 Application   Features Supported                   GPU Perf      Release Status                                       Notes

                  (ss|ss) type integrals within
               calculations using Hartree Fock ab
                                                                      Release in 2012            http://www.ncbi.nlm.nih.gov/pubmed/215419
 GAMESS-UK         initio methods and density           8x
                                                                     Multi-GPU support                               63
                   functional theory. Supports
                      organics & inorganics.

                                                                    Under development
                 Joint PGI, NVIDIA & Gaussian                                                                  Announced Aug. 29, 2011
  Gaussian               Collaboration
                                                       TBD           Multi-GPU support             http://www.gaussian.com/g_press/nvidia_press.htm



                Electrostatic poisson equation,
                                                                          Released
                 orthonormalizing of vectors,                                                    https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
   GPAW         residual minimization method
                                                        8x           Multi-GPU support              Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
                          (rmm-diis)

                                                                    Under development
                                                                                                               Schrodinger, Inc.
   Jaguar       Investigating GPU acceleration         TBD           Multi-GPU support
                                                                                                      http://www.schrodinger.com/kb/278


                                                                    Released, Version 7.8
  MOLCAS               CU_BLAS support                 1.1x      Single GPU. Additional GPU                       www.molcas.org
                                                                 support coming in Version 8


                 Density-fitted MP2 (DF-MP2),
                                                      1.7-2.3X      Under development                            www.molpro.net
  MOLPRO        density fitted local correlation
                                                     projected         Multiple GPU                            Hans-Joachim Werner
                methods (DF-RHF, DF-KS), DFT
                                                                                               GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                  GPU Perf benchmarked on GPU supported features
                                                                                                      and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
                       Features
 Application                                     GPU Perf            Release Status                                    Notes
                      Supported
                pseudodiagonalization, full
                                                                       Under Development                            Academic port.
 MOPAC2009      diagonalization, and density        3.8-14X
                                                                           Single GPU                           http://openmopac.net
                     matrix assembling


                                                                                                           Development GPGPU benchmarks:
                Triples part of Reg-CCSD(T),                                                                     www.nwchem-sw.org
                                                                   Release targeting March 2013
  NWChem           CCSD & EOMCCSD task           3-10X projected
                                                                          Multiple GPUs
                                                                                                           And http://www.olcf.ornl.gov/wp-
                         schedulers                                                                      content/training/electronic-structure-
                                                                                                           2012/Krishnamoorthy-ESCMA12.pdf


  Octopus             DFT and TDDFT                   TBD                   Released                   http://www.tddft.org/programs/octopus/



               Density functional theory (DFT)                                                       First principles materials code that computes
                                                                            Released
   PEtot        plane wave pseudopotential           6-10X
                                                                            Multi-GPU
                                                                                                       the behavior of the electron structures of
                         calculations                                                                                   materials


                                                                                                                  http://www.q-
  Q-CHEM                   RI-MP2                    8x-14x           Released, Version 4.0
                                                                                                    chem.com/doc_for_web/qchem_manual_4.0.pdf



                                                                                                  GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                     GPU Perf benchmarked on GPU supported features
                                                                                                         and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
                        Features
 Application                                       GPU Perf           Release Status                                  Notes
                       Supported

                                                                                                                        NCSA
                                                                            Released                University of Illinois at Urbana-Champaign
  QMCPACK                Main features                  3-4x
                                                                          Multiple GPUs           http://cms.mcc.uiuc.edu/qmcpack/index.php
                                                                                                           /GPU_version_of_QMCPACK



                                                                                                            Created by Irish Centre for
   Quantum        PWscf package: linear algebra
                   (matrix multiply), explicit        2.5-3.5x
                                                                            Released
                                                                           Version 5.0
                                                                                                               High-End Computing
                                                                                                  http://www.quantum-espresso.org/index.php
Espresso/PWscf   computational kernels, 3D FFTs                           Multiple GPUs
                                                                                                     and http://www.quantum-espresso.org/


                                                                                                             Completely redesigned to
                                                                                                        exploit GPU parallelism. YouTube:
                                                   44-650X vs.               Released
                                                                                                     http://youtu.be/EJODzk6RFxE?hd=1 and
  TeraChem        “Full GPU-based solution”        GAMESS CPU               Version 1.5
                                                                                                          http://www.olcf.ornl.gov/wp-
                                                     version          Multi-GPU/single node           content/training/electronic-structure-
                                                                                                              2012/Luehr-ESCMA.pdf


                                                       2x
                  Hybrid Hartree-Fock DFT
                                                     2 GPUs           Available on request                 By Carnegie Mellon University
    VASP         functionals including exact
                                                  comparable to          Multiple GPUs                  http://arxiv.org/pdf/1111.0716.pdf
                         exchange
                                                  128 CPU cores


                   Generalized Wang-Landau
                                                          3x
                                                                        Under development     GPU Perf Electronic Structure Determination Workshop 2012:
                                                                                                  NICS
                                                                                                        compared against Multi-core x86 CPU socket.
                                                                                                              http://www.olcf.ornl.gov/wp-
   WL-LSMS                  method
                                                  with 32 GPUs vs.
                                                                         Multi-GPU support       GPU Perf benchmarked on GPU supported features
                                                                                                           content/training/electronic-structure-
                                                  32 (16-core) CPUs                                  and2012/Eisenbach_OakRidge_February.pdfcomparison
                                                                                                           may be a kernel to kernel perf
Viz, ―Docking‖ and Related Applications Growing
   Related                Features
                                                               GPU Perf             Release Status                                          Notes
 Applications            Supported

                                                                                                                    Visualization from Visage Imaging. Next release, 5.4, will use
                  3D visualization of volumetric                                     Released, Version 5.3.3
   Amira 5®             data and surfaces
                                                                     70x
                                                                                           Single GPU
                                                                                                                        GPU for general purpose processing in some functions
                                                                                                                           http://www.visageimaging.com/overview.html



                                                                                                                         High-Throughput parallel blind Virtual Screening,
                  Allows fast processing of large                                   Available upon request to
   BINDSURF              ligand databases
                                                                     100X
                                                                                       authors; single GPU
                                                                                                                      http://www.biomedcentral.com/1471-2105/13/S14/S13




                           Empirical Free                                                   Released                                   University of Bristol
     BUDE                 Energy Forcefield
                                                                  6.5-13.4X
                                                                                           Single GPU                http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm


                                                                                      Released, Suite 2011                             Schrodinger, Inc.
  Core Hopping     GPU accelerated application                    3.75-5000X
                                                                                     Single and multi-GPUs.               http://www.schrodinger.com/products/14/32/



                     Real-time shape similarity                                             Released                               Open Eyes Scientific Software
   FastROCS            searching/comparison
                                                                  800-3000X
                                                                                     Single and multi-GPUs.                     http://www.eyesopen.com/fastrocs


                          Lines: 460% increase
                        Cartoons: 1246% increase
                                                                                      Released, Version 1.5
     PyMol              Surface: 1746% increase                     1700x
                                                                                           Single GPUs
                                                                                                                                         http://pymol.org/
                         Spheres: 753% increase
                         Ribbon: 426% increase


                          High quality rendering,                                                     GPU Perf compared against Multi-core x86 CPU socket.
                   large structures (100 million atoms),
                                                              100-125X or greater                           GPU Perf benchmarked on GPU supported features
                                                                                                                 Visualization from University of Illinois at Urbana-Champaign
      VMD        analysis and visualization tasks, multiple
                                                                  on kernels
                                                                                      Released, Version 1.9
                                                                                                               and mayhttp://www.ks.uiuc.edu/Research/vmd/
                                                                                                                            be a kernel to kernel perf comparison
                   GPU support for display of molecular
Bioinformatics Applications
                     Features                     GPU
 Application                                                         Release Status                              Website
                    Supported                   Speedup
               Alignment of short sequencing                           Version 0.6.2 – 3/2012
  BarraCUDA                reads
                                                  6-10x
                                                                       Multi-GPU, multi-node
                                                                                                     http://seqbarracuda.sourceforge.net/


                  Parallel search of Smith-                            Version 2.0.8 – Q1/2012
  CUDASW++          Waterman database
                                                  10-50x
                                                                        Multi-GPU, multi-node
                                                                                                    http://sourceforge.net/projects/cudasw/


                Parallel, accurate long read                           Version 1.0.40 – 6/2012
   CUSHAW        aligner for large genomes
                                                   10x
                                                                            Multiple-GPU
                                                                                                        http://cushaw.sourceforge.net/


               Protein alignment according to                          Version 2.2.26 – 3/2012    http://eudoxus.cheme.cmu.edu/gpublast/gpu
 GPU-BLAST                 BLASTP
                                                   3-4x
                                                                             Single GPU                            blast.html

                  Parallel local and global
                                                                       Version 2.3.2 – Q1/2012    http://www.mpihmmer.org/installguideGPUH
 GPU-HMMER        search of Hidden Markov        60-100x
                                                                        Multi-GPU, multi-node                    MMER.htm
                            Models

                 Scalable motif discovery                                  Version 3.0.12         https://sites.google.com/site/yongchaosoftwa
 mCUDA-MEME      algorithm based on MEME
                                                  4-10x
                                                                       Multi-GPU, multi-node                     re/mcuda-meme


                 Hardware and software for
                                                                             Released.
  SeqNFind     reference assembly, blast, SW,     400x
                                                                       Multi-GPU, multi-node
                                                                                                           http://www.seqnfind.com/
                   HMM, de novo assembly


                                                                       Version 1.11 – 5/2012
   UGENE         Fast short read alignment         6-8x
                                                                       Multi-GPU, multi-node
                                                                                                            http://ugene.unipro.ru/
                                                           GPU Perf compared against same or similar code running on single CPU machine
                Parallel linear regression on                                           Performance measured internally or independently
MD Average Speedups
                                                                                                                               The blue node contains Dual E5-2687W CPUs
                                   10                                                                                          (8 Cores per CPU).

                                                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                                                                                                                               Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only




                                    8                                                                                          K20X GPUs.



                                    6




                                    4




                                    2




                                    0
                                        CPU     CPU + K10   CPU + K20   CPU + K20X   CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X



Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
                                              Error bars show the maximum and minimum speedup for each hardware configuration.
Molecular Dynamics (MD) Applications
                      Features
 Application                                      GPU Perf              Release Status                        Notes/Benchmarks
                     Supported
                                                   > 100 ns/day                                              AMBER 12, GPU Revision Support 12.2
                PMEMD Explicit Solvent & GB                                      Released
   AMBER             Implicit Solvent
                                                  JAC NVE on 2X
                                                                          Multi-GPU, multi-node
                                                                                                            http://ambermd.org/gpus/benchmarks.
                                                       K20s                                                           htm#Benchmarks


                                                  2x C2070 equals                                                      Release C37b1;
                  Implicit (5x), Explicit (2x)                                   Released
  CHARMM            Solvent via OpenMM
                                                   32-35x X5667
                                                                     Single & multi-GPU in single node
                                                                                                         http://www.charmm.org/news/c37b1.html#po
                                                       CPUs                                                                stjump


                 Two-body Forces, Link-cell                                                                      Source only, Results Published
                                                                             Release V 4.03
  DL_POLY        Pairs, Ewald SPME forces,              4x
                                                                          Multi-GPU, multi-node
                                                                                                         http://www.stfc.ac.uk/CSE/randd/ccg/softwa
                         Shake VV                                                                                   re/DL_POLY/25526.aspx


                                                    165 ns/Day
                                                                                 Released
  GROMACS         Implicit (5x), Explicit (2x)       DHFR on
                                                                          Multi-GPU, multi-node
                                                                                                              Release 4.6; 1st Multi-GPU support
                                                    4X C2075s

                                                                                                         http://lammps.sandia.gov/bench.html#deskto
                 Lennard-Jones, Gay-Berne,                                      Released.
  LAMMPS       Tersoff & many more potentials
                                                  3.5-18x on Titan
                                                                          Multi-GPU, multi-node
                                                                                                                            p and
                                                                                                          http://lammps.sandia.gov/bench.html#titan


                                                    4.0 ns/days                  Released
               Full electrostatics with PME and
   NAMD            most simulation features
                                                   F1-ATPase on            100M atom capable                              NAMD 2.9
                                                      1x K20X             Multi-GPU, multi-node
                                                                                          GPU Perf compared against Multi-core x86 CPU socket.
                                                                                             GPU Perf benchmarked on GPU supported features
                                                                                                  and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping
                      Features
Application                                        GPU Perf                Release Status                                  Notes
                     Supported
                                                         4-29X              Released, Version 1.8.51
  Abalone         Simulations (on 1060 GPU)
                                                     (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                 Computation of non-valent               4-29X              Released, Version 1.1.4
  Ascalaph             interactions                  (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                                                  150 ns/day DHFR on                Released                 Production bio-molecular dynamics (MD)
   ACEMD        Written for use only on GPUs
                                                        1x K20               Single and multi-GPUs         software specially optimized to run on GPUs

               Powerful distributed computing
                                                     Depends upon                 Released;                       http://folding.stanford.edu
Folding@Home    molecular dynamics system;
                                                    number of GPUs              GPUs and CPUs                    GPUs get 4X the points of CPUs
                 implicit solvent and folding

                 High-performance all-atom
                                                     Depends upon                 Released;                         http://www.gpugrid.net/
GPUGrid.net       biomolecular simulations;
                                                    number of GPUs             NVIDIA GPUs only
                 explicit solvent and binding
                   Simple fluids and binary
               mixtures (pair potentials, high-    Up to 66x on 2090        Released, Version 0.2.0       http://halmd.org/benchmarks.html#supercool
   HALMD       precision NVE and NVT, dynamic       vs. 1 CPU core                Single GPU                     ed-binary-mixture-kob-andersen
                         correlations)

                                                    Kepler 2X faster        Released, Version 0.11.2        http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue      Written for use only on GPUs
                                                      than Fermi         Single and multi-GPU on 1 node          Multi-GPU w/ MPI in March 2013

                                                   Implicit: 127-213
                Implicit and explicit solvent,                               Released Version 4.1.1       Library and application for molecular dynamics
  OpenMM                custom forces
                                                  ns/day Explicit: 18-
                                                                                   Multi-GPU                           on high-performance
                                                    55 ns/day DHFR
                                                                                            GPU Perf compared against Multi-core x86 CPU socket.
                                                                                               GPU Perf benchmarked on GPU supported features
                                                                                                   and may be a kernel to kernel perf comparison
Built from Ground Up for GPUs
       Computational Chemistry

       Study disease & discover drugs
What
       Predict drug and protein interactions
                                                                                                                          GPU READY
       Speed of simulations is critical                                                                                  APPLICATIONS
Why    Enables study of:
                                                                                                                            Abalone
                                                                                                                             ACEMD
                 Longer timeframes                                                                                           AMBER
                 Larger systems                                                                                             DL_PLOY
                 More simulations                                                                                           GAMESS


How    GPUs increase throughput & accelerate simulations
                                                                                                                           GROMACS
                                                                                                                            LAMMPS
                                                                                                                             NAMD
            AMBER 11 Application                                                                                            NWChem
            4.6x performance increase with 2 GPUs with                                                                      Q-CHEM
            only a 54% added cost*                                                                                      Quantum Espresso
                                                                                                                           TeraChem
        •    AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
        •    Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
AMBER 12
     GPU Support Revision 12.2
            1/22/2013




15
Kepler - Our Fastest Family of GPUs Yet
                         30.00
                                                         Factor IX                                                   Running AMBER 12 GPU Support Revision 12.1
                                                                                                       25.39
                         25.00                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                     22.44                           (8 Cores per CPU).
                                                                                                      7.4x           The green nodes contain Dual E5-2687W CPUs (8
                         20.00                                    18.90                                              Cores per CPU) and either 1x NVIDIA M2090, 1x K10
     Nanoseconds / Day




                                                                                                                     or 1x K20 for the GPU
                                                                                    6.6x
                         15.00

                                                 11.85           5.6x
                         10.00


                                                3.5x
                          5.00
                                    3.42



                          0.00
                                                                                                                                        Factor IX
                                 1 CPU Node   1 CPU Node +   1 CPU Node + K10   1 CPU Node + K20 1 CPU Node + K20X
                                                 M2090


                         GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
                         when compared to a CPU only node
16
K10 Accelerates Simulations of All Sizes
                               30
                                                                                                                               Running AMBER 12 GPU Support Revision 12.1

                                                                                                                               The blue node contains Dual E5-2687W CPUs
                               25                                                                                   24.00      (8 Cores per CPU).
Speedup Compared to CPU Only




                                                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                                                                                                        19.98
                               20                                                                                              Cores per CPU) and 1x NVIDIA K10 GPU



                               15



                               10


                                                               5.50         5.53          5.04
                               5
                                                     2.00

                               0
                                         CPU        TRPcage   JAC NVE   Factor IX NVE Cellulose NVE   Myoglobin   Nucleosome
                                    All Molecules     GB        PME         PME            PME          GB            GB


                                             Gain 24x performance by adding just 1 GPU
                                                                                                                                           Nucleosome
                                             when compared to dual CPU performance
K20 Accelerates Simulations of All Sizes
                                    30.00
                                                                                                                             28.00
                                                                                                                                            Running AMBER 12 GPU Support Revision 12.1
                                                                                                               25.56                        SPFP with CUDA 4.2.9 ECC Off
                                    25.00
                                                                                                                                            The blue node contains 2x Intel E5-2687W CPUs
     Speedup Compared to CPU Only




                                                                                                                                            (8 Cores per CPU)
                                    20.00
                                                                                                                                            Each green nodes contains 2x Intel E5-2687W
                                                                                                                                            CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

                                    15.00



                                    10.00
                                                                                                   7.28
                                                                         6.50         6.56

                                     5.00
                                                            2.66
                                               1.00
                                     0.00
                                              CPU All    TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB   Nucleosome
                                             Molecules                              PME            PME                         GB


                                            Gain 28x throughput/performance by adding just one K20 GPU
                                                                                                                                                               Nucleosome
                                            when compared to dual CPU performance

18                                                                                                                                      AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
K20X Accelerates Simulations of All Sizes
                                    35
                                                                                                                         31.30         Running AMBER 12 GPU Support Revision 12.1
                                    30                                                                       28.59
                                                                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                                                                       (8 Cores per CPU).
     Speedup Compared to CPU Only




                                    25
                                                                                                                                       The green nodes contain Dual E5-2687W CPUs (8
                                                                                                                                       Cores per CPU) and 1x NVIDIA K20X GPU
                                    20


                                    15


                                    10                                                         8.30
                                                                    7.15         7.43

                                     5
                                                          2.79


                                     0
                                              CPU        TRPcage   JAC NVE   Factor IX NVE Cellulose NVE   Myoglobin   Nucleosome
                                         All Molecules     GB        PME         PME            PME          GB            GB



                                          Gain 31x performance by adding just one K20X GPU
                                                                                                                                                           Nucleosome
                                          when compared to dual CPU performance

19                                                                                                                                  AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
K10 Strong Scaling over Nodes
                                    Cellulose 408K Atoms (NPT)                         Running AMBER 12 with CUDA 4.2 ECC Off
                    6                                                                  The blue nodes contains 2x Intel X5670
                                                                                       CPUs (6 Cores per CPU)

                    5                                                                  The green nodes contains 2x Intel X5670
                                                                                       CPUs (6 Cores per CPU) plus 2x NVIDIA
                                                                                       K10 GPUs
                    4
Nanoseconds / Day




                                                                     2.4x
                    3
                                                                            CPU Only
                                                 3.6x                       With GPU
                    2

                             5.1x
                    1

                                                                                                  Cellulose
                    0
                         1                       2               4
                                          Number of Nodes


                         GPUs significantly outperform CPUs while scaling over multiple nodes
Kepler – Universally Faster
                                9
                                                                                                  Running AMBER 12 GPU Support Revision 12.1
                                8                                                                 The CPU Only node contains Dual E5-2687W CPUs
                                                                                                  (8 Cores per CPU).
Speedups Compared to CPU Only




                                7
                                                                                                  The Kepler nodes contain Dual E5-2687W CPUs (8
                                6                                                                 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
                                                                                                  GPUs
                                5
                                                                                      JAC

                                4                                                     Factor IX
                                                                                      Cellulose
                                3

                                2

                                1

                                0
                                    CPU Only    CPU + K10    CPU + K20   CPU + K20X                             Cellulose


                                    The Kepler GPUs accelerated all simulations, up to 8x
K10 Extreme Performance
                                                                        Running AMBER 12 GPU Support Revision 12.1
                                      JAC 23K Atoms (NVE)
                    120                                                 The blue node contains Dual E5-2687W CPUs
                                                                        (8 Cores per CPU).

                                                               97.99    The green node contain Dual E5-2687W CPUs (8
                    100
                                                                        Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day




                     80


                     60


                     40


                     20
                                   12.47


                      0
                                   1 Node                     1 Node
                                                                                        DHFR

                          Gain 7.8X performance by adding just 2 GPUs
                            when compared to dual CPU performance
K20 Extreme Performance
                                            DHRF JAC 23K Atoms (NVE)                          Running AMBER 12 GPU Support Revision 12.1
                                                                                              SPFP with CUDA 4.2.9 ECC Off
                         120

                                                                                              The blue node contains 2x Intel E5-2687W CPUs
                                                                          95.59               (8 Cores per CPU)
                         100

                                                                                              Each green node contains 2x Intel E5-2687W
                                                                                              CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
     Nanoseconds / Day




                          80


                          60


                          40


                          20                 12.47


                           0
                                            1 Node                       1 Node
                                                                                                                    DHFR

                               Gain > 7.5X throughput/performance by adding just 2 K20 GPUs
                                          when compared to dual CPU performance

23                                                                                        AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Replace 8 Nodes with 1 K20 GPU
     90.00                                                                 35000
                                               $32,000.00
                                                                                       Running AMBER 12 GPU Support Revision 12.1
                                 81.09                                                 SPFP with CUDA 4.2.9 ECC Off
     80.00
                                                                           30000
                                                                                       The eight (8) blue nodes each contain 2x Intel
     70.00                                                                             E5-2687W CPUs (8 Cores per CPU)
                     65.00
                                                                           25000
                                                                                       Each green node contains 2x Intel E5-2687W
     60.00
                                                                                       CPUs (8 Cores per CPU) plus 1x NVIDIA K20
                                                                                       GPU
     50.00                                                                 20000

                                                                                       Note: Typical CPU and GPU node pricing used.
     40.00                                                                             Pricing may vary depending on node
                                                                           15000
                                                                                       configuration. Contact your preferred HW vendor
                                                                                       for actual pricing.
     30.00
                                                                           10000
     20.00                                                     $6,500.00

                                                                           5000
     10.00


      0.00                                                                 0
                      Nanoseconds/Day                   Cost

                                                                                                           DHFR
             Cut down simulation costs to ¼ and gain higher performance

24                                                                                 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Replace 7 Nodes with 1 K10 GPU
                          Performance on JAC NVE                         Cost                       Running AMBER 12 GPU Support Revision 12.1
                                                                                                    SPFP with CUDA 4.2.9 ECC Off
                         80                               $35,000.00
                                                                       $32,000
                                                                                                    The eight (8) blue nodes each contain 2x Intel
                         70                               $30,000.00                                E5-2687W CPUs (8 Cores per CPU)

                         60
                                                                                                    The green node contains 2x Intel E5-2687W
                                                          $25,000.00                                CPUs (8 Cores per CPU) plus 1x NVIDIA K10
     Nanoseconds / Day




                                                                                                    GPU
                         50
                                                          $20,000.00                                Note: Typical CPU and GPU node pricing used.
                         40                                                                         Pricing may vary depending on node
                                                          $15,000.00                                configuration. Contact your preferred HW vendor
                         30                                                                         for actual pricing.

                                                          $10,000.00
                         20                                                         $7,000

                         10                                $5,000.00


                          0                                    $0.00
                                CPU Only    GPU Enabled                CPU Only   GPU Enabled

                                                                                                                        DHFR
                         Cut down simulation costs to ¼ and increase performance by 70%

25                                                                                              AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Extra CPUs decrease Performance
                                   Cellulose NVE                                          Running AMBER 12 GPU Support Revision 12.1

                    8                                                                     The orange bars contains one E5-2687W CPUs
                                                                                          (8 Cores per CPU).
                    7
                                                                                          The blue bars contain Dual E5-2687W CPUs (8
                    6                                                                     Cores per CPU)
Nanoseconds / Day




                                                             2 CPUs 2 GPUs
                                              1 CPU 2 GPUs
                    5

                    4                                                        1 E5-2687W
                                                                             2 E5-2687W
                    3

                    2

                    1

                    0                                                                                   Cellulose
                        CPU Only             CPU with dual K20s




When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
Kepler - Greener Science
                                                                                               Running AMBER 12 GPU Support Revision 12.1
                              Energy used in simulating 1 ns of DHFR JAC
                       2500                                                                    The blue node contains Dual E5-2687W CPUs
                                                                                               (150W each, 8 Cores per CPU).

                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                       2000                                                                    Cores per CPU) and 1x NVIDIA K10, K20, or K20X
                                                                Lower is better                GPUs (235W each).
Energy Expended (kJ)




                       1500


                                                                                                        Energy Expended
                       1000
                                                                                                        = Power x Time

                        500



                          0
                               CPU Only      CPU + K10     CPU + K20              CPU + K20X


                              The GPU Accelerated systems use 65-75% less energy
Recommended GPU Node Configuration for
         AMBER Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                    2

                    Cores per CPU socket                  4+ (1 CPU core drives 1 GPU)

                      CPU speed (Ghz)                                 2.66+
                System memory per node (GB)                              16

                                                              Kepler K10, K20, K20X
                           GPUs
                                                           Fermi M2090, M2075, C2075


                                                                         1-2
                  # of GPUs per CPU socket                 (4 GPUs on 1 socket is good
                                                            to do 4 fast serial GPU runs)

                 GPU memory preference (GB)                               6
                   GPU to CPU connection                      PCIe 2.0 16x or higher

                       Server storage                                  2 TB
28   Scale to multiple nodes with same single node configuration     AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Benefits of GPU AMBER Accelerated Computing
     Faster than CPU only systems in all tests

     Most major compute intensive aspects of classical MD ported

     Large performance boost with marginal price increase

     Energy usage cut by more than half

     GPUs scale well within a node and over multiple nodes

     K20 GPU is our fastest and lowest power high performance GPU yet

        Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive
29                                                        AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
NAMD 2.9
Kepler - Our Fastest Family of GPUs Yet
                       4.50
                                                          ApoA1                                                   Running NAMD version 2.9
                                                                                                    4.00
                       4.00                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                   3.57                           (8 Cores per CPU).
                                                                3.45
                       3.50
                                                                                                                  The green nodes contain Dual E5-2687W CPUs (8
                                                                                                   2.9x           Cores per CPU) and either 1x NVIDIA M2090, 1x K10
                       3.00                                                                                       or 1x K20 for the GPU
     Nanoseconds/Day




                                               2.63
                                                                                 2.6x
                       2.50

                                                              2.5x
                       2.00


                       1.50      1.37        1.9x

                       1.00


                       0.50


                       0.00
                              1 CPU Node   1 CPU Node +   1 CPU Node + K10   1 CPU Node + K20 1 CPU Node + K20X
                                                                                                                                        Apolipoprotein A1
                                              M2090


                  GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
                  when compared to a CPU only node
31                                                                                                                NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Accelerates Simulations of All Sizes
                                     3
                                                                                                      Running NAMD 2.9 with CUDA 4.0 ECC Off
                                                                                         2.7
                                                               2.6
                                                                                                      The blue node contains 2x Intel E5-2687W CPUs
                                    2.5                                     2.4
                                                                                                      (8 Cores per CPU)
     Speedup Compared to CPU Only




                                                                                                      Each green node contains 2x Intel E5-2687W
                                     2                                                                CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs



                                    1.5



                                     1



                                    0.5



                                     0
                                          CPU All Molecules   ApoA1      F1-ATPase       STMV
                                                                                                                        Apolipoprotein A1

                                          Gain 2.5x throughput/performance by adding just 1 GPU
                                          when compared to dual CPU performance

32                                                                                                NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Kepler – Universally Faster
                               6
                                                                                                                               Running NAMD version 2.9

                                                                                                                               The CPU Only node contains Dual E5-2687W CPUs
                               5                                                                                               (8 Cores per CPU).
Speedup Compared to CPU Only




                                                                                                      5.1x                     The Kepler nodes contain Dual E5-2687W CPUs (8
                               4                                                            4.7x                               Cores per CPU) and 1 or two NVIDIA K10, K20, or
                                                                                                                               K20X GPUs.
                                                                                  4.3x
                                                                                                                   F1-ATPase
                               3
                                                                                                                   ApoA1
                                                                                                                   STMV
                                                                        2.9x
                               2
                                                           2.6x
                                                  2.4x

                               1



                               0
                                   CPU Only       1x K10   1x K20      1x K20X    2x K10    2x K20   2x K20X
                                                                                                                                           F1-ATPase
                                              |                     Kepler nodes use Dual CPUs                 |

                                        The Kepler GPUs accelerate all simulations, up to 5x
                                        Average acceleration printed in bars
Outstanding Strong Scaling with Multi-STMV
                                                                                            Running NAMD version 2.9
                                                                                            Each blue XE6 CPU node contains 1x AMD
                                     100 STMV on Hundreds of Nodes                          1600 Opteron (16 Cores per CPU).
                    1.2

                                  Fermi XK6                                                 Each green XK6 CPU+GPU node contains
                                                                                            1x AMD 1600 Opteron (16 Cores per CPU)
                     1                                                                      and an additional 1x NVIDIA X2090 GPU.
                                  CPU XK6
                                                                                     2.7x
Nanoseconds / Day




                    0.8

                                                                      2.9x
                    0.6



                    0.4



                    0.2
                                                3.6x
                          3.8x                                                                       Concatenation of 100
                     0                                                                           Satellite Tobacco Mosaic Virus
                             32      64       128          256      512      640   768
                                                       # of Nodes


                    Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
Replace 3 Nodes with 1 2090 GPU
                                                                                     Running NAMD version 2.9
                                                                                     Each blue node contains 2x Intel Xeon X5550 CPUs
                           F1-ATPase                                                 (4 Cores per CPU).
                                                                4 CPU Nodes
0.8                                                                           9000
                    0.74                                                             The green node contains 2x Intel Xeon X5550 CPUs
                                       $8,000
                                                                1 CPU Node +8000     (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7                                                             1x M2090 GPUs
         0.63
                                                                              7000   Note: Typical CPU and GPU node pricing used. Pricing
0.6                                                                                  may vary depending on node configuration. Contact your
                                                                              6000   preferred HW vendor for actual pricing.
0.5
                                                                              5000
0.4                                                    $4,000
                                                                              4000
0.3
                                                                              3000
0.2
                                                                              2000

0.1                                                                           1000

 0                                                                            0
         Nanoseconds/Day                        Cost




         Speedup of 1.2x for 50% the cost                                                                F1-ATPase
K20 - Greener: Twice The Science Per Watt
                            1200000
                                      Energy Used in Simulating 1 Nanosecond of ApoA1
                                                                                                        Running NAMD version 2.9
                            1000000                                                                     Each blue node contains Dual E5-2687W
                                                                                                        CPUs (95W, 4 Cores per CPU).

                                                                                                        Each green node contains 2x Intel Xeon X5550
     Energy Expended (kJ)




                             800000
                                                                                                        CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
                                                             Lower is better                            K20 GPUs (225W per GPU)

                             600000


                                                                                                               Energy Expended
                             400000
                                                                                                               = Power x Time

                             200000



                                  0
                                              1 Node                           1 Node + 2x K20


                                          Cut down energy usage by ½ with GPUs

36                                                                                               NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Kepler - Greener: Twice The Science/Joule
                                Energy used in simulating 1 ns of SMTV
                       250000
                                                                                              Running NAMD version 2.9

                                                                                              The blue node contains Dual E5-2687W CPUs
                       200000                                                                 (150W each, 8 Cores per CPU).
Energy Expended (kJ)




                                                                            Lower is better   The green nodes contain Dual E5-2687W CPUs
                                                                                              (8 Cores per CPU) and 2x NVIDIA K10, K20, or
                       150000
                                                                                              K20X GPUs (235W each).

                                                                                                   Energy Expended
                       100000
                                                                                                   = Power x Time

                        50000



                            0
                                CPU Only      CPU + 2 K10s   CPU + 2 K20s     CPU + 2 K20Xs




                                       Cut down energy usage by ½ with GPUs

                                                                                                   Satellite Tobacco Mosaic Virus
Recommended GPU Node Configuration for
         NAMD Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                  2
                    Cores per CPU socket                               6+
                      CPU speed (Ghz)                                2.66+
               System memory per socket (GB)                           32

                                                             Kepler K10, K20, K20X
                           GPUs
                                                          Fermi M2090, M2075, C2075

                  # of GPUs per CPU socket                            1-2
                GPU memory preference (GB)                              6
                   GPU to CPU connection                       PCIe 2.0 or higher

                       Server storage                          500 GB or higher

                    Network configuration                      Gemini, InfiniBand

38   Scale to multiple nodes with same single node configuration    NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Summary/Conclusions
     Benefits of GPU Accelerated Computing
     Faster than CPU only systems in all tests

     Large performance boost with small marginal price increase

     Energy usage cut in half

     GPUs scale very well within a node and over multiple nodes

     Tesla K20 GPU is our fastest and lowest power high performance GPU to date

       Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive
39                                                           NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
LAMMPS, Jan. 2013 or later
More Science for Your Money
                                                  Embedded Atom Model                                           Blue node uses 2x E5-2687W (8 Cores
                               6                                                                                and 150W per CPU).
                                                                                                       5.5
                                                                                                                Green nodes have 2x E5-2687W and 1
                               5                                                                                or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only




                                                                                            4.5


                               4
                                                                                 3.3
                                                                      2.92
                               3
                                                           2.47

                               2                1.7


                               1



                               0
                                   CPU Only   CPU + 1x   CPU + 1x   CPU + 1x   CPU + 2x   CPU + 2x   CPU + 2x
                                                K10        K20       K20X        K10        K20       K20X


                                   Experience performance increases of up to 5.5x with Kepler GPU nodes.
K20X, the Fastest GPU Yet
                                7                                                                   Blue node uses 2x E5-2687W (8 Cores
                                                                                                    and 150W per CPU).
                                6
                                                                                                    Green nodes have 2x E5-2687W and 2
                                                                                                    NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone




                                5


                                4


                                3


                                2


                                1


                                0
                                         CPU Only     CPU + 2x M2090   CPU + K20X   CPU + 2x K20X




                                    Experience performance increases of up to 6.2x with Kepler GPU nodes.
                                    One K20X performs as well as two M2090s
Get a CPU Rebate to Fund Part of Your GPU Budget
                               Acceleration in Loop Time Computation by
                                            Additional GPUs
                                                                                                                Running NAMD version 2.9
                          20
                                                                                                  18.2
                                                                                                                The blue node contains Dual X5670 CPUs
                          18
                                                                                                                (6 Cores per CPU).
                          16
                                                                                                                The green nodes contain Dual X5570 CPUs
 Normalized to CPU Only




                          14                                                     12.9                           (4 Cores per CPU) and 1-4 NVIDIA M2090
                                                                                                                GPUs.
                          12
                                                                9.88
                          10

                           8

                           6                   5.31

                           4

                           2

                           0
                                1 Node   1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090


                                 Increase performance 18x when compared to CPU-only nodes


                          Cheaper CPUs used with GPUs AND still faster overall performance when
                                          compared to more expensive CPUs!
Excellent Strong Scaling on Large Clusters
                                                 LAMMPS Gay-Berne 134M Atoms

                            600
                                                                                                    GPU Accelerated XK6
                            500
                                                                                                    CPU only XE6
      Loop Time (seconds)




                            400
                                       3.55x
                            300


                            200
                                                                              3.48x
                                                                                                                            3.45x
                            100


                              0
                                     300           400          500          600          700          800            900
                                                                            Nodes

    From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
                            compared to XE6 CPU nodes
                            Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
                            Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
GPUs Sustain 5x Performance for Weak Scaling
                                                Weak Scaling with 32K Atoms per Node
                              45

                              40

        Loop Time (seconds)   35

                              30
                                         6.7x                        5.8x                       4.8x
                              25

                              20

                              15

                              10

                               5

                               0
                                     1           8     27    64    125      216   343   512   729
                                                                  Nodes
                                   Performance of 4.8x-6.7x with GPU-accelerated nodes
                                             when compared to CPUs alone
     Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
     Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
Faster, Greener — Worth It!
                         Energy Consumed in one loop of EAM
                       140


                       120                                                              GPU-accelerated computing uses
                                              Lower is better                            53% less energy than CPU only
                       100
Energy Expended (kJ)




                        80


                        60
                                                                                      Energy Expended = Power x Time
                                                                                      Power calculated by combining the component’s TDPs
                        40


                        20


                         0
                                1 Node         1 Node + 1 K20X    1 Node + 2x K20X


                                Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
                                Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.


                             Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
Molecular Dynamics with LAMMPS
 on a Hybrid Cray Supercomputer
                    W. Michael Brown
        National Center for Computational Sciences
              Oak Ridge National Laboratory

      NVIDIA Technology Theater, Supercomputing 2012
                     November 14, 2012
Early Kepler Benchmarks on Titan
                      32.00                                                                             4
                      16.00
                                                                   XK7+GPU
                       8.00
                       4.00                                        XK6                                  3




               Time (s)
Atomic Fluid           2.00




                                                                                                            Time (s)
                                                                   XK6+GPU
                       1.00                                                                             2
                       0.50                                        XK7+GPU
                       0.25                                        XK6
                       0.13                                                                             1
                                                                   XK6+GPU
                       0.06
                       0.03                                                                             0
                                   1   2   4   8   16 32 64 128   Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                              64

                                                                                                6


                                                                                              96
                                                                                              24



                                                                                               4
                                                                                             25




                                                                                             38
                                                                                           40
                                                                                           10


                                                                                          16
                                                                                                        3.0
                            8.00
                                                                  XK7+GPU                               2.5
                            4.00
                                                                                                        2.0




                                                                                                                   Time (s)
                            2.00
                 Time (s)




Bulk Copper                                                       XK6                                   1.5
                            1.00
                                                                                                        1.0
                            0.50                                  XK6+GPU
                                                                                                        0.5
                            0.25
                                                                                                        0.0
                            0.13                                  Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                          64

                                                                                                    6


                                                                                                   96
                                                                                                   24



                                                                                                    4
                                                                                                  25




                                                                                                 38
                                   1   2   4   8   16 32 64 128




                                                                                                40
                                                                                                10


                                                                                               16
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications

Weitere ähnliche Inhalte

Was ist angesagt?

Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
Riikka Puurunen
 

Was ist angesagt? (20)

Carbon Nanotubes
Carbon NanotubesCarbon Nanotubes
Carbon Nanotubes
 
schottky barrier and contact resistance
schottky barrier and contact resistanceschottky barrier and contact resistance
schottky barrier and contact resistance
 
AFM and STM (Scanning probe microscopy)
AFM and STM (Scanning probe microscopy)AFM and STM (Scanning probe microscopy)
AFM and STM (Scanning probe microscopy)
 
Research proposal on organic-inorganic halide perovskite light harvesting mat...
Research proposal on organic-inorganic halide perovskite light harvesting mat...Research proposal on organic-inorganic halide perovskite light harvesting mat...
Research proposal on organic-inorganic halide perovskite light harvesting mat...
 
Mems & nems technology represented by k.r. bhardwaj
Mems & nems technology represented by k.r. bhardwajMems & nems technology represented by k.r. bhardwaj
Mems & nems technology represented by k.r. bhardwaj
 
Scanning Tunneling Microscope
Scanning Tunneling MicroscopeScanning Tunneling Microscope
Scanning Tunneling Microscope
 
Atomic Force Microscopy (AFM)
Atomic Force Microscopy (AFM)Atomic Force Microscopy (AFM)
Atomic Force Microscopy (AFM)
 
STM ppt
STM pptSTM ppt
STM ppt
 
Top-Down and Bottom_Up Approches
Top-Down  and Bottom_Up ApprochesTop-Down  and Bottom_Up Approches
Top-Down and Bottom_Up Approches
 
Part III. Metal-Organic Chemical Vapor Deposition
Part III. Metal-Organic Chemical Vapor DepositionPart III. Metal-Organic Chemical Vapor Deposition
Part III. Metal-Organic Chemical Vapor Deposition
 
Perovskite
PerovskitePerovskite
Perovskite
 
CNFET Technology
CNFET TechnologyCNFET Technology
CNFET Technology
 
Atomic force microscopy
Atomic force microscopyAtomic force microscopy
Atomic force microscopy
 
Microfluidics
MicrofluidicsMicrofluidics
Microfluidics
 
Presentation
PresentationPresentation
Presentation
 
6.a.preparation of thin films
6.a.preparation of thin films6.a.preparation of thin films
6.a.preparation of thin films
 
Photolithography
PhotolithographyPhotolithography
Photolithography
 
Surface coverage in atomic layer deposition - slides related to invited talk ...
Surface coverage in atomic layer deposition - slides related to invited talk ...Surface coverage in atomic layer deposition - slides related to invited talk ...
Surface coverage in atomic layer deposition - slides related to invited talk ...
 
Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
Invited talk at 98th CSC: Surface chemistry of ALD: mechanisms and conformality
 
Fabrication and characterization of nanowire devices
Fabrication and characterization of nanowire devicesFabrication and characterization of nanowire devices
Fabrication and characterization of nanowire devices
 

Ähnlich wie GPU Accelerated Computational Chemistry Applications

PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
RBratton
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
Jason Shih
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
Agora Group
 

Ähnlich wie GPU Accelerated Computational Chemistry Applications (20)

Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPC
 
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And ResearchGPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
 
Top500 List June 2012
Top500 List June 2012Top500 List June 2012
Top500 List June 2012
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
 
Accelerating Scientific Discovery V1
Accelerating Scientific Discovery V1Accelerating Scientific Discovery V1
Accelerating Scientific Discovery V1
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin Processors
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
GPU for DL
GPU for DLGPU for DL
GPU for DL
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
 
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastQnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

GPU Accelerated Computational Chemistry Applications

  • 1. update Updated: February 4, 2013
  • 2. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 3. New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.edu Folding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/ GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 4. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org Abinit diagonalization / 1.3-2.7X Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development ACES III SIAL programming language and 10X on kernels Multi-GPU support content/training/electronic-structure- SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, BigDFT part of Abinit (1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. Casino TBD TBD Spring 2013 release html Multi-GPU support http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development CP2K library) 2-7X Multi-GPU support content/training/ascc_2012/friday/ACSS_2012_V andeVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Next release Q4 2012. GAMESS-US Algorithm, Hartree-Fock, MP2 2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html and CCSD in Q4 2012 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 5. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes (ss|ss) type integrals within calculations using Hartree Fock ab Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 GAMESS-UK initio methods and density 8x Multi-GPU support 63 functional theory. Supports organics & inorganics. Under development Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011 Gaussian Collaboration TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Electrostatic poisson equation, Released orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html, GPAW residual minimization method 8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC) (rmm-diis) Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8 Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development www.molpro.net MOLPRO density fitted local correlation projected Multiple GPU Hans-Joachim Werner methods (DF-RHF, DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 6. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported pseudodiagonalization, full Under Development Academic port. MOPAC2009 diagonalization, and density 3.8-14X Single GPU http://openmopac.net matrix assembling Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting March 2013 NWChem CCSD & EOMCCSD task 3-10X projected Multiple GPUs And http://www.olcf.ornl.gov/wp- schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/ Density functional theory (DFT) First principles materials code that computes Released PEtot plane wave pseudopotential 6-10X Multi-GPU the behavior of the electron structures of calculations materials http://www.q- Q-CHEM RI-MP2 8x-14x Released, Version 4.0 chem.com/doc_for_web/qchem_manual_4.0.pdf GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 7. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported NCSA Released University of Illinois at Urbana-Champaign QMCPACK Main features 3-4x Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK Created by Irish Centre for Quantum PWscf package: linear algebra (matrix multiply), explicit 2.5-3.5x Released Version 5.0 High-End Computing http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs and http://www.quantum-espresso.org/ Completely redesigned to exploit GPU parallelism. YouTube: 44-650X vs. Released http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5 http://www.olcf.ornl.gov/wp- version Multi-GPU/single node content/training/electronic-structure- 2012/Luehr-ESCMA.pdf 2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University VASP functionals including exact comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores Generalized Wang-Landau 3x Under development GPU Perf Electronic Structure Determination Workshop 2012: NICS compared against Multi-core x86 CPU socket. http://www.olcf.ornl.gov/wp- WL-LSMS method with 32 GPUs vs. Multi-GPU support GPU Perf benchmarked on GPU supported features content/training/electronic-structure- 32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison may be a kernel to kernel perf
  • 8. Viz, ―Docking‖ and Related Applications Growing Related Features GPU Perf Release Status Notes Applications Supported Visualization from Visage Imaging. Next release, 5.4, will use 3D visualization of volumetric Released, Version 5.3.3 Amira 5® data and surfaces 70x Single GPU GPU for general purpose processing in some functions http://www.visageimaging.com/overview.html High-Throughput parallel blind Virtual Screening, Allows fast processing of large Available upon request to BINDSURF ligand databases 100X authors; single GPU http://www.biomedcentral.com/1471-2105/13/S14/S13 Empirical Free Released University of Bristol BUDE Energy Forcefield 6.5-13.4X Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm Released, Suite 2011 Schrodinger, Inc. Core Hopping GPU accelerated application 3.75-5000X Single and multi-GPUs. http://www.schrodinger.com/products/14/32/ Real-time shape similarity Released Open Eyes Scientific Software FastROCS searching/comparison 800-3000X Single and multi-GPUs. http://www.eyesopen.com/fastrocs Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x Single GPUs http://pymol.org/ Spheres: 753% increase Ribbon: 426% increase High quality rendering, GPU Perf compared against Multi-core x86 CPU socket. large structures (100 million atoms), 100-125X or greater GPU Perf benchmarked on GPU supported features Visualization from University of Illinois at Urbana-Champaign VMD analysis and visualization tasks, multiple on kernels Released, Version 1.9 and mayhttp://www.ks.uiuc.edu/Research/vmd/ be a kernel to kernel perf comparison GPU support for display of molecular
  • 9. Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Version 0.6.2 – 3/2012 BarraCUDA reads 6-10x Multi-GPU, multi-node http://seqbarracuda.sourceforge.net/ Parallel search of Smith- Version 2.0.8 – Q1/2012 CUDASW++ Waterman database 10-50x Multi-GPU, multi-node http://sourceforge.net/projects/cudasw/ Parallel, accurate long read Version 1.0.40 – 6/2012 CUSHAW aligner for large genomes 10x Multiple-GPU http://cushaw.sourceforge.net/ Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu GPU-BLAST BLASTP 3-4x Single GPU blast.html Parallel local and global Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH GPU-HMMER search of Hidden Markov 60-100x Multi-GPU, multi-node MMER.htm Models Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa mCUDA-MEME algorithm based on MEME 4-10x Multi-GPU, multi-node re/mcuda-meme Hardware and software for Released. SeqNFind reference assembly, blast, SW, 400x Multi-GPU, multi-node http://www.seqnfind.com/ HMM, de novo assembly Version 1.11 – 5/2012 UGENE Fast short read alignment 6-8x Multi-GPU, multi-node http://ugene.unipro.ru/ GPU Perf compared against same or similar code running on single CPU machine Parallel linear regression on Performance measured internally or independently
  • 10.
  • 11. MD Average Speedups The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K10, K20, or Performance Relative to CPU Only 8 K20X GPUs. 6 4 2 0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration.
  • 12. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 13. New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.edu Folding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/ GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 14. Built from Ground Up for GPUs Computational Chemistry Study disease & discover drugs What Predict drug and protein interactions GPU READY Speed of simulations is critical APPLICATIONS Why Enables study of: Abalone ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESS How GPUs increase throughput & accelerate simulations GROMACS LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
  • 15. AMBER 12 GPU Support Revision 12.2 1/22/2013 15
  • 16. Kepler - Our Fastest Family of GPUs Yet 30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU). 7.4x The green nodes contain Dual E5-2687W CPUs (8 20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 Nanoseconds / Day or 1x K20 for the GPU 6.6x 15.00 11.85 5.6x 10.00 3.5x 5.00 3.42 0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 16
  • 17. K10 Accelerates Simulations of All Sizes 30 Running AMBER 12 GPU Support Revision 12.1 The blue node contains Dual E5-2687W CPUs 25 24.00 (8 Cores per CPU). Speedup Compared to CPU Only The green nodes contain Dual E5-2687W CPUs (8 19.98 20 Cores per CPU) and 1x NVIDIA K10 GPU 15 10 5.50 5.53 5.04 5 2.00 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance
  • 18. K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00 The blue node contains 2x Intel E5-2687W CPUs Speedup Compared to CPU Only (8 Cores per CPU) 20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 15.00 10.00 7.28 6.50 6.56 5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance 18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 19. K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1 30 28.59 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). Speedup Compared to CPU Only 25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU 20 15 10 8.30 7.15 7.43 5 2.79 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance 19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 20. K10 Strong Scaling over Nodes Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off 6 The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) 5 The green nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) plus 2x NVIDIA K10 GPUs 4 Nanoseconds / Day 2.4x 3 CPU Only 3.6x With GPU 2 5.1x 1 Cellulose 0 1 2 4 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes
  • 21. Kepler – Universally Faster 9 Running AMBER 12 GPU Support Revision 12.1 8 The CPU Only node contains Dual E5-2687W CPUs (8 Cores per CPU). Speedups Compared to CPU Only 7 The Kepler nodes contain Dual E5-2687W CPUs (8 6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs 5 JAC 4 Factor IX Cellulose 3 2 1 0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose The Kepler GPUs accelerated all simulations, up to 8x
  • 22. K10 Extreme Performance Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 97.99 The green node contain Dual E5-2687W CPUs (8 100 Cores per CPU) and 2x NVIDIA K10 GPUs Nanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance
  • 23. K20 Extreme Performance DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 120 The blue node contains 2x Intel E5-2687W CPUs 95.59 (8 Cores per CPU) 100 Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU Nanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance 23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 24. Replace 8 Nodes with 1 K20 GPU 90.00 35000 $32,000.00 Running AMBER 12 GPU Support Revision 12.1 81.09 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x Intel 70.00 E5-2687W CPUs (8 Cores per CPU) 65.00 25000 Each green node contains 2x Intel E5-2687W 60.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU 50.00 20000 Note: Typical CPU and GPU node pricing used. 40.00 Pricing may vary depending on node 15000 configuration. Contact your preferred HW vendor for actual pricing. 30.00 10000 20.00 $6,500.00 5000 10.00 0.00 0 Nanoseconds/Day Cost DHFR Cut down simulation costs to ¼ and gain higher performance 24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 25. Replace 7 Nodes with 1 K10 GPU Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x Intel 70 $30,000.00 E5-2687W CPUs (8 Cores per CPU) 60 The green node contains 2x Intel E5-2687W $25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10 Nanoseconds / Day GPU 50 $20,000.00 Note: Typical CPU and GPU node pricing used. 40 Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW vendor 30 for actual pricing. $10,000.00 20 $7,000 10 $5,000.00 0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled DHFR Cut down simulation costs to ¼ and increase performance by 70% 25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 26. Extra CPUs decrease Performance Cellulose NVE Running AMBER 12 GPU Support Revision 12.1 8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8 6 Cores per CPU) Nanoseconds / Day 2 CPUs 2 GPUs 1 CPU 2 GPUs 5 4 1 E5-2687W 2 E5-2687W 3 2 1 0 Cellulose CPU Only CPU with dual K20s When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
  • 27. Kepler - Greener Science Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X Lower is better GPUs (235W each). Energy Expended (kJ) 1500 Energy Expended 1000 = Power x Time 500 0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy
  • 28. Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 4+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs) GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB 28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 29. Benefits of GPU AMBER Accelerated Computing Faster than CPU only systems in all tests Most major compute intensive aspects of classical MD ported Large performance boost with marginal price increase Energy usage cut by more than half GPUs scale well within a node and over multiple nodes K20 GPU is our fastest and lowest power high performance GPU yet Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive 29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 31. Kepler - Our Fastest Family of GPUs Yet 4.50 ApoA1 Running NAMD version 2.9 4.00 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 The green nodes contain Dual E5-2687W CPUs (8 2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10 3.00 or 1x K20 for the GPU Nanoseconds/Day 2.63 2.6x 2.50 2.5x 2.00 1.50 1.37 1.9x 1.00 0.50 0.00 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X Apolipoprotein A1 M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node 31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 32. Accelerates Simulations of All Sizes 3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU) Speedup Compared to CPU Only Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 1.5 1 0.5 0 CPU All Molecules ApoA1 F1-ATPase STMV Apolipoprotein A1 Gain 2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance 32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 33. Kepler – Universally Faster 6 Running NAMD version 2.9 The CPU Only node contains Dual E5-2687W CPUs 5 (8 Cores per CPU). Speedup Compared to CPU Only 5.1x The Kepler nodes contain Dual E5-2687W CPUs (8 4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or K20X GPUs. 4.3x F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x 2.4x 1 0 CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X F1-ATPase | Kepler nodes use Dual CPUs | The Kepler GPUs accelerate all simulations, up to 5x Average acceleration printed in bars
  • 34. Outstanding Strong Scaling with Multi-STMV Running NAMD version 2.9 Each blue XE6 CPU node contains 1x AMD 100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU). 1.2 Fermi XK6 Each green XK6 CPU+GPU node contains 1x AMD 1600 Opteron (16 Cores per CPU) 1 and an additional 1x NVIDIA X2090 GPU. CPU XK6 2.7x Nanoseconds / Day 0.8 2.9x 0.6 0.4 0.2 3.6x 3.8x Concatenation of 100 0 Satellite Tobacco Mosaic Virus 32 64 128 256 512 640 768 # of Nodes Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
  • 35. Replace 3 Nodes with 1 2090 GPU Running NAMD version 2.9 Each blue node contains 2x Intel Xeon X5550 CPUs F1-ATPase (4 Cores per CPU). 4 CPU Nodes 0.8 9000 0.74 The green node contains 2x Intel Xeon X5550 CPUs $8,000 1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU 0.7 1x M2090 GPUs 0.63 7000 Note: Typical CPU and GPU node pricing used. Pricing 0.6 may vary depending on node configuration. Contact your 6000 preferred HW vendor for actual pricing. 0.5 5000 0.4 $4,000 4000 0.3 3000 0.2 2000 0.1 1000 0 0 Nanoseconds/Day Cost Speedup of 1.2x for 50% the cost F1-ATPase
  • 36. K20 - Greener: Twice The Science Per Watt 1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU). Each green node contains 2x Intel Xeon X5550 Energy Expended (kJ) 800000 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA Lower is better K20 GPUs (225W per GPU) 600000 Energy Expended 400000 = Power x Time 200000 0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs 36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 37. Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9 The blue node contains Dual E5-2687W CPUs 200000 (150W each, 8 Cores per CPU). Energy Expended (kJ) Lower is better The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or 150000 K20X GPUs (235W each). Energy Expended 100000 = Power x Time 50000 0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs Cut down energy usage by ½ with GPUs Satellite Tobacco Mosaic Virus
  • 38. Recommended GPU Node Configuration for NAMD Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand 38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 39. Summary/Conclusions Benefits of GPU Accelerated Computing Faster than CPU only systems in all tests Large performance boost with small marginal price increase Energy usage cut in half GPUs scale very well within a node and over multiple nodes Tesla K20 GPU is our fastest and lowest power high performance GPU to date Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive 39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 40. LAMMPS, Jan. 2013 or later
  • 41. More Science for Your Money Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores 6 and 150W per CPU). 5.5 Green nodes have 2x E5-2687W and 1 5 or 2 NVIDIA K10, K20, or K20X GPUs (235W). Speedup Compared to CPU Only 4.5 4 3.3 2.92 3 2.47 2 1.7 1 0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes.
  • 42. K20X, the Fastest GPU Yet 7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 6 Green nodes have 2x E5-2687W and 2 NVIDIA M2090s or K20X GPUs (235W). Speedup Relative to CPU Alone 5 4 3 2 1 0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s
  • 43. Get a CPU Rebate to Fund Part of Your GPU Budget Acceleration in Loop Time Computation by Additional GPUs Running NAMD version 2.9 20 18.2 The blue node contains Dual X5670 CPUs 18 (6 Cores per CPU). 16 The green nodes contain Dual X5570 CPUs Normalized to CPU Only 14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090 GPUs. 12 9.88 10 8 6 5.31 4 2 0 1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090 Increase performance 18x when compared to CPU-only nodes Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs!
  • 44. Excellent Strong Scaling on Large Clusters LAMMPS Gay-Berne 134M Atoms 600 GPU Accelerated XK6 500 CPU only XE6 Loop Time (seconds) 400 3.55x 300 200 3.48x 3.45x 100 0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
  • 45. GPUs Sustain 5x Performance for Weak Scaling Weak Scaling with 32K Atoms per Node 45 40 Loop Time (seconds) 35 30 6.7x 5.8x 4.8x 25 20 15 10 5 0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
  • 46. Faster, Greener — Worth It! Energy Consumed in one loop of EAM 140 120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only 100 Energy Expended (kJ) 80 60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs 40 20 0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36. Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
  • 47. Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012
  • 48. Early Kepler Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00 4.00 XK6 3 Time (s) Atomic Fluid 2.00 Time (s) XK6+GPU 1.00 2 0.50 XK7+GPU 0.25 XK6 0.13 1 XK6+GPU 0.06 0.03 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16 3.0 8.00 XK7+GPU 2.5 4.00 2.0 Time (s) 2.00 Time (s) Bulk Copper XK6 1.5 1.00 1.0 0.50 XK6+GPU 0.5 0.25 0.0 0.13 Nodes 1 4 16 64 6 96 24 4 25 38 1 2 4 8 16 32 64 128 40 10 16

Hinweis der Redaktion

  1. Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  2. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  3. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  4. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  5. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  6. cpuk10k20k20x2k102k202k20xambercellulose 0.744.345.396.146.46.787.5 factor9 nve3.4218.922.425.429.228.131.4jacnve12.4768.681.189.19895.6102.1trpcage210420559585418451475namdapoa11.373.453.5746.256.677.14atpase0.460.961.121.251.782.042.22stmv0.1150.290.310.350.520.560.61lammpsfluid lj 511.951.112.82 4.382.223.62eam11.72.472.923.34.55.5 rhodopsin11.330.771.62.351.482.28gromacsrnase46.7109120
  7. Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  8. ns/dayDual E5-2687W CPUs 3.4Dual E5-2687W CPUs + M2090 11.9Dual E5-2687W CPUs + K10 18.9Dual E5-2687W CPUs + K20 22.4Dual E5-2687W CPUs + K20X 25.39
  9. cpu ns/day gpu ns/dayTrpcage 210 420Jacnve 12.47 68.6Factor 9 3.42 18.9Cellulose .74 3.73Myoglobin 6.12 122.3Nucleosome .1 2.4
  10. cpu ns/day gpu ns/dayTRPcage GB 210.32 559.32JAC NVE PME 12.47 81.09Factor IX NVE PME 3.42 22.44Cellulose NVE PME 0.74 5.39Myoglobin GB 6.12 156.45Nucleosome GB 0.10 2.80SPFP ECC off
  11. cpu ns/day gpu ns/dayTrpcage 210 585Jacnve 12.47 89.13Factor 9 3.42 25.4Cellulose .74 6.14Myoglobin 6.12 175.77Nucleosome .1 3.13
  12. Nodes cpu ns/day gpu ns/day1 .65 3.312 1.14 4.134 2.01 4.8
  13. Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  14. 1 CPU node (dual CPUs) = 12.47 ns/day1 CPU+ GPU node (dual CPUs and GPUs) = 95.59 ns/day
  15. Perflab:no gpuk10k20k20x2k102k202k20xcell 1 cpu0.374.445.46.166.376.937.67cell 2 cpu0.744.345.396.146.46.787.5
  16. Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  17. ns/dayDual E5-2687W CPUs 1.370Dual E5-2687W CPUs + M2090 2.632Dual E5-2687W CPUs + K10 3.448Dual E5-2687W CPUs + K20 3.571Dual E5-2687W CPUs + K20X 4.000
  18. cpu ns/day gpu ns/dayApoA1 1.370 3.571F1-ATPase 0.461 1.124STMV 0.116 0.314ECC off
  19. All #s are days/ns apoa1atpasestmvCPU Only0.732.178.641x K100.291.043.51x K200.28 0.893.181x K20X0.250.82.872x K100.160.561.932x K200.150.491.772x K20X0.140.451.63
  20. 32 64 128 256 512 640 768s/step GPU XK6 1.2414 0.660887 0.342743 0.199465 0.10837 0.089752 0.0774948s/step CPU XK6 4.62633 2.36707 1.19722 0.609124 0.314745 0.255016 0.209511ns/day Fermi XK6 0.069599 0.13073339 0.252084 0.433159 0.797269 0.962655 1.114913517ns/day CPU XK6 0.018676 0.03650082 0.072167 0.141843 0.274508 0.338802 0.412388848
  21. Config: TDP sec/ns energy 2x E5-2687W 150 63,072.0 9,460,800.0 2x E5-2687W+ 2x K20 600 24,192.0 14,515,200 TDP = Thermal Design Power
  22. ns/day tdp energyCpu .115 300 223kK10s .518 770 128kK20s .565 770 117kK20xs .613 770 108k
  23. CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  24. CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  25. Config: loop time:2x X5670 (HP Z800) 2717.6301xM2090 (2xX5570)511.7502xM2090 (2xX5570)274.9703xM2090 (2xX5570)210.4304xM2090 (2xX5570)148.880
  26. nodes:300400500600700800900CPU-only time:563.96423.83339.62281.58260.98220.83203.13CPU+GPU time: 159.06118.6296.4481.0371.5763.7658.96GPU speedup ratio:  3.553.573.523.483.653.463.45
  27. Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
  28. Power WTimeenergy spentCpu 300 382 114Cpu 1 k20x 535 130 69Cpu 2 k20x 770 70 54
  29. ns/daySingle E5-2687W CPUs 4.35 (1.0X)Dual E5-2687W CPUs 7.32 (1.7X)Single E5-2687W CPUs + M2090 7.33 (1.7X)Dual E5-2687W CPUs + M2090 7.54 (1.7X)Single E5-2687W CPUs + K10 13.24 (3.0X)Dual E5-2687W CPUs + K10 13.24 (3.0X)Single E5-2687W CPUs + K20 11.6 (2.7X)Dual E5-2687W CPUs + K20 12.26 (2.8X)Single E5-2687W CPUs + K20X 11.99 (2.7X)Dual E5-2687W CPUs + K20X 12.27 (2.8X)
  30. Nodes CPU only gpu1 2.26 8.362 3.58 13.014 6.7 21.68
  31. Nodes CPU GPU86.61320.3351611.28237.01632 23.06763.8766442.28496.62812872.694 144.424
  32. nanoseconds/day8 X5550 6.72M2090+2X5550 8.36CPU Node: 4 X 2 X $1000 = $8000CPU + GPU Node: 1 X 2 X $1000 + 2 X $2000 = $6000
  33. GPU: 640 (watts) * 10,334 (seconds/nanosecond) = 6.6 MegaJoulesCPU: 760 (watts) * 12,895 (seconds/nanosecond) = 9.8 MegaJoules
  34. 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day6025.342.4price6000030004000
  35. 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day 6025.342.4price6000030004000scaled price1 0.050.066666667perf/price18.43333333310.6
  36. 64 cpu2 cpu 1 gpu2 cpu 2 gpuNs/day 318.915.1tdp6080428666sec/ns2787.09679707.86515721.8543energy/ns16945.5484154.96623810.7549
  37. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  38. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  39. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  40. Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  41. Test case not specified in perf lab run
  42. Test case not specified in perf lab run
  43. I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run Quantum Espresso/PWscf and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.What is Quantum Espresso/PWscf:-A set of programs used to calculate the electron configuration of atoms or molecules-Uses plane wave basis sets and quantum mechanical principles-Highly compute intensiveBenefits of GPU-accelerated Computing:-Faster than CPU only systems in all tests-Performance boost much larger than marginal price increase-Power consumption more than halved in all simulations-GPUs scale very well on clusters with dozens of nodes, and beyond===============Price assumes FERMI workstation ~$4000 and C2050 $1000shilu 3 water on calcite6 OpenMP CPU nodes 1025 15606 OpenMP CPU nodes 1 gpu 275 480FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  44. ausurf k ptausurf gamma 6 OpenMP CPU nodes 7100 s 7000 s 6 OpenMP CPU nodes 1 gpu 2350 s 2000 s FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  45. CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  46. National average 9.83 cents/kWh kWh/sim tests/year $/test yearly energy billCPU 42.372357 4.16 $9816GPU/CPU 23.214 2400 2.28 $5476CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  47. FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  48. total # of core2 (16) 4 (32)6 (48)8 (64)10 (80)12 (96)14 (112)time (s) cpu3100016500110009500750060005500time gpu+cpu12500700050004500350030002500SPEEDUP2.482.3571432.22.1111112.14285722.2STONEY (ICHEC): Bull Novascale R422-E2, 24 GPU nodesCPU: 2 Intel (Nehalem EP) Xeon X5560, 48 GByte RAMGPU: 2 x M2090SW: CUDA 4.0, Intel compilers
  49. # of cores4(48)8(96)12(144)16(192)24(288)32(384)44(528)time cpu3925265025252450174012901337time gpu+cpu242514371075900737675637SPEEDUP1.6185571.844122.3488372.7222222.3609231.9111112.098901PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  50. I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run TeraChem and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.Benefits of GPU Acceleration with TeraChem-Compete with Supercomputers-More powerful hardware-Significantly lower energy usage
  51. Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.