This document summarizes a collaboration between the Technische Universität Dresden and Indiana University to analyze the performance of a molecular dynamics simulation code. The code was tested using different parallelization methods including OpenMP, MPI, and hybrid implementations. Initial tests on a Cray XT5m supercomputer identified the best performing serial configuration. Further parallel analysis aimed to measure scalability and detect bottlenecks limiting efficiency.
1. PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE
Evaluating Serial, Thread and Process-Parallel Performance
Molecular Dynamics Collaboration between
A molecular dynamics code simulating the PTI and ZIH
diffusion in dense nuclear matter in white dwarf
stars is analyzed in this collaboration. The code Planning In 2008 a collaboration between the Technische
is highly configurable allowing MPI, OpenMP, or Universität Dresden, Center for Information
hybrid runs and additional fine tuning with a Services and High Performance Computing
range of parameters. (ZIH), in Germany and the Indiana University,
Im h VampirTrace
Pervasive Technology Institute (PTI), was
wit
plem
founded to mutually benefit from common
valuation
r
Vampi
research areas.
entation
Serial Analysis
with
These research topics include:
The first step in the code analysis is to identify ‣ Data-centric computing
E
the best performing parameter set. This ‣ Computing for biological and life sciences
configuration represents the most promising ‣ Wide area distributed file systems
candidate for further parallel analysis. ‣ Parallel computing performance
Testing
This collaboration involves ZIH maintaining a
Parallel Analysis stack of performance evaluation tools inside the
NSF FutureGrid project. For the analysis an
Aim of the parallel analysis is to measure the HPC system called Xray, a Cray XT5m which is
scalability limits of the different parallel code Authors part of the FutureGrid hardware, was used.
implementations and to detect bottlenecks Xray consists of Quad-core 64-bit AMD
possibly preventing further parallel efficiency. T. William, M. Weber, D. Röhrig - ZIH, TU Dresden Opteron series 2000 processors. PGI compilers
This work has been done with the parallel D. K. Berry, R. Henschel - UITS, IU Bloomington version 9.0.4 and the optimized xtpe-barcelona
analysis framework Vampir. J. Hughto, A. S. Schneider, C. J. Horowitz - Physics, IU Bloomington module provided by Cray have been used.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
2. MOLECULAR DYNAMICS CODE
Overview
The analyzed molecular dynamics code (MD) The movie shows a 5125 (5k) particle
has been developed at Indiana University for simulation of oxygen ions which are forming
simulating dense nuclear matter in white dwarf microcrystals:
and neutron stars.
‣ 1280 Carbon ions
Matter in these compact stellar objects consists
of either completely ionized atoms, or else free ‣ 3740 Oxygen ions
neutrons and protons.
‣ 100 Neon ions
The MD code models these systems as classical
particles interacting via a screened Coulomb By studying the motion of ions over a long
interaction. The electrons are not modeled simulation time, one can calculate the self-
explicitly, but are treated as a uniform diffusion coefficient of ions, an important
background charge that serves to screen the quantity for understanding the behavior of
Coulomb interaction. white dwarfs.
MD-Code Ion Mix Movements
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
3. MOLECULAR DYNAMICS CODE
Implementation Details
During compilation the input parameter PP0x
MD is a hybrid MPI/OpenMP parallel code decides what file is used for the simulation.
written in Fortran. Its main features are: Inside the PP0x files the parameters Ax and Bx
denote different code-blocks that implement
‣ Simple particle-particle algorithm equal semantics in different ways. Additionally,
‣ Long range interaction with large cut-off the file PP03 provides the define NBS to
sphere influence the vector access length. Those
‣ Cutoff radius is half the box edge length parameters have been used throughout the
‣ Each ion interacts with about half the years to re-implement code utilizing newer
others constructs of Fortran95 that had not been
‣ Interactions are calculated in a pair of available in Fortran77.
nested do loops
‣ Outer loop goes over "target" particles PP01:
‣ Inner loop goes over "source" particles ‣ Original implementation
‣ Targets are assigned in a round-robin ‣ No division into Ax, Bx, or NBS
fashion to MPI processes ‣ Baseline for other measurements
‣ Within each MPI process, the work is
shared among OpenMP threads PP02:
‣ A0, A1, A2
The MD code is highly modular, consisting of ‣ B0, B1, B2
several different implementations of the same ‣ No NBS
simulation logic . The par ticle-par ticle Code block in the PP03 file showing different implementations
interaction semantics have been implemented of the same loop selectable by Ax preprocessor macros
PP03:
multiple times in different ways. Each ‣ A0, A1, A2
implementation is stored in an individual source ‣ B0, B1, B2, B3, B4, B5, B6, B7
file, called from PP01 to PP03. ‣ NBS
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
4. SCOPE OF THE CODE ANALYSIS
Hardware Details & Parameter Sweep
All measurements done on FutureGrid Time estimate per measurements and Serial analysis:
machine Xray: particles:
‣ 55k particles
‣ Cray XT5m 5k ! ~ 300 seconds! ~5 minutes ‣ Comparing optimization flags:
‣ 672 cores 27k ! ~ 4000 seconds! ~1 h ! O2
‣ 84 nodes 55k ! ~ 35000 seconds! ~10 h ! O3
‣ 2 CPU per node ! fastsse
‣ Quad-Core AMD Opteron(tm) Processor 23 (C2) ‣ fastsse == -fast
‣ 2.4 GHz ‣ fast - Common optimizations;
‣ ... ! -O2 -Munroll=c:1 -Mnoframe -Mlre
! -Mautoinline -Mvect=sse -Mscalarsse
! -Mcache_align -Mflushz
Particle-type Loop combination Paralellism
nucleon-nucleon Block A: ! ! A0-A2 serial: ! md
ion pure Block B: ! ! B1-B7 OpenMP:! md_omp
ion mix Code-blocking: ! NBS MPI: ! md_mpi
Hybrid: ! md_mpi_omp
Input parameter Compiler Flags Code version
simulation type O2 Original: ! ! PP01
number of particles O3 Production: ! PP02
number of time steps SSE Research:! ! PP03
....
>8100 possible combinations
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
5. SERIAL ANALYSIS
Identifying The Best Configuration For Parallel Runs
Runtime for all code cominations
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
6. SERIAL ANALYSIS
Identifying The Best Configuration For Parallel Runs
Runtime in seconds
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
7. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
PAPI_FP_INS
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
8. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
FPU Idle
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
9. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
Branches mispredicted
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
10. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
PAPI_VEC_INS
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
11. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
L1 hit ratio
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
12. SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions
PAPI_TO_CYC
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
13. SOURCE CODE ANALYSIS
Codeblocks Ax & Bx For Ion-Mix (PP02)
A0:
! ! Branching
A1:
! ! Arithmetic
A2:
! ! Array syntax
Block A
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
14. SOURCE CODE ANALYSIS
Codeblocks Ax & Bx For Ion-Mix (PP02)
B0:
! ! Loop
B1:
! ! No Cut-off Sphere
B2:
! ! Cut-off Sphere
Block B
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
15. SOURCE CODE ANALYSIS
Codeblocks Ax & Bx For Ion-Mix (PP02)
B0:
! ! Loop
B1:
! ! No Cut-off Sphere
B2:
! ! Cut-off Sphere
Result: "Calculation does not beat branching"
Block B
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
16. PARALLEL ANALYSIS
First Scalability Evaluation
The MD code is used in production with particle counts between 27k and
55k. Production runs currently use the PP02 implementation of particle-
particle interaction semantics. The code blocks selected are A0 and B2, and
the vector length (NBS) is set to 32. Measurements show that this
configuration is very efficient and yields good performance.
So far, runtimes of larger runs have only been analyzed using the hybrid
version. The chart shows speedups on a large Cray XT5 (Kraken) and the
resulting efficiency for 27k and 55k particle runs. Up to 1152 cores, the
efficiency is higher than 80%. At 4608 cores the efficiency drops below 50%
for 27k particles. Using more ions shows the effect of weak scaling that can
be used to optimize the efficiency so that even on 4608 cores an efficiency
of 58% can be achieved.
Although these results are very promising, Kraken has 112,896 compute
cores that could be used for computation. The mid term goal is therefore
to optimize the MD code to achieve 50% parallel efficiency for 16000
cores. To achieve this we will take a more detailed look at the 4 different
versions of the code (serial, OpenMP, MPI and hybrid).
The code is very efficient for multi-core processors such as the 6-core processors on Kraken, the
Cray XT5 at The National Institute for Computational Sciences. Each processor is assigned one
MPI process. The work of each process is then shared among 6 OpenMP threads.
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
17. PARALLEL ANALYSIS WITH VAMPIR
ZIH has more then a decade of expertise in analyzing
sequential and parallel codes. As part of the
collaboration with IU, ZIH applied its analysis software
Vampir to the different versions of MD.
Vampir consists of a tracing framework called
VampirTrace that is able to monitor all levels of
parallelism of the MD code. Additionally PAPI
counters are recorded in order to analyze the cache
behavior of the main calculation loop.
In a first step traces of the four versions were
generated using only 5k particles to get a broad
overview. The next step will be to refine the tracing
mechanism using manually instrumented code to avoid
the rather huge overhead introduced by OpenMP
tracing. This will enable monitoring of at least 4096
cores to get an understanding why the efficiency
drops below 50% at this core count.
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
18. OPENMP CODE OPTIMIZATION
Comparison of OpenMP Versions using 1 Node and 1-8 Cores
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
19. OPENMP CODE OPTIMIZATION
Comparison of OpenMP Versions using 1 Node and 1-8 Cores
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
20. PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 8 nodes, 1 process per node
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
21. PARALLEL ANALYSIS WITH VAMPIR
20+ loops of the A0_B2 block
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
22. PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 1 nodes, 8 processes per node
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
23. PARALLEL ANALYSIS WITH VAMPIR
20+ loops of the A0_B2 block
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
24. PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 84 nodes, 8 process per node (672 core == full machine run)
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
25. PARALLEL ANALYSIS WITH VAMPIR
switching of groups and flushing of VT buffers
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
26. PARALLEL ANALYSIS WITH VAMPIR
55k particles, OpenMP only, 1 Node, 8 processes
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
27. PARALLEL ANALYSIS WITH VAMPIR
55k particles, Hybrid, 84 MPI processes (1 per Node), 8 Threads per process (672 core == full machine run)
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
28. PARALLEL ANALYSIS WITH VAMPIR
Second group, 100 runs
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
29. SCALABILITY STUDIES ON XRAY
1 Node 8 Nodes
Parallel Efficiency (MPI-only runs)
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
30. SCALABILITY STUDIES ON XRAY
Parallel Efficiency (Hybrid runs)
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/
31. CONCLUSION AND FUTURE WORK
To find the optimal serial version of MD, different combinations of the code
blocks and compiler flags were tested. The results from the serial analysis built
the starting point for the parallel analysis.
Vampir was then successfully applied to the code making it possible to visualize Planning
MPI and OpenMP parallelism.
An important step for future parallel analysis is to embed PAPI counter directly
Im h VampirTrace
in the MD code to avoid the overhead caused by VampirTrace. This enables the
wit
plem
comparison of the actually achieved performance with the theoretical peak
valuation
performance using larger core counts.
r
Vampi
entation
Additional tests with larger particle counts are planned to exploit the weak
with
scaling capabilities of the code. The hybrid implementation is to be run in
different process/thread configurations to research wether fitting MPI/OpenMP
E
usage to the hardware characteristics can further increase performance.
Analysis tests will be re-run on Kraken with up to 16386 cores. The goal is to
detect scalability issues and to significantly improve parallel efficiency. Te s ti n g
The MD code is currently being ported to GPGPUs. Vampir already supports
the analysis of CUDA parallelized code. This analysis support will enable efficient
ported code allowing for the utilization of the maximum possible potential
provided by GPGPUs.
Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408
E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/