3. Molecular models/force fields
Typical energy function
E= kb (rb − req,b )2+ κa (θa − θeq,a )2 + lnd cos (nπ)
d∈dihedrals n
a∈angles
b∈bonds
bond stretch angle torsion dihedrals
+ -
12 6
qi qj σij σij
+ + −
ij
rij rij rij
i<j∈atoms i<j∈atoms
electrostatics dispersion
computation cost = O(N2)
4. Problem description
• The state of the system is given by the position and
momentum of every atom (of mass mi)
(x1 , p1 , x2 , p2 , · · · , xN , pN ) ∈ R 3×2×N
• Solve the system∂p partial differential equations
of
∂x p ∂E
i i i
= =− , i = 1, · · · , N
,
∂t mi ∂t ∂xi
• with user-specified initial conditions (e.g. with
constant temperature and pressure)
• Subject to (user-specified) constraints, e.g. fixed
bond angles
5. Many parallel and serial
implementations
Global
Package name Threads MPI
Arrays
NAMD CHARM++
GROMACS ✓ ✓
TINKER
AMBER partly ✓ ✓
CHARMM ✓
LAMMPS ✓
NWChem ✓ ✓
6. Things I tried
• Compiler flags optimization
• Cache miss reduction
• Lookup tables
• Parallelization with OpenMP
7. Compiler flag optimization
flags gfortran 4.1.2 ifort 10.0.023
- -
-O0 29.95(2) s 36.30(2) s
32.59(4) s
-Os 29.92(3) s +0.77(3) % +10.22(2) %
32.12(3) s
-O1 30.22(1) s -0.90(4) % +11.51(1) %
-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %
30.83(2) s
-O3 29.84(2) s +0.38(2) % +15.06(2) %
+20.22(1)%2
CE search 28.77(2) s +3.62(3) %1 28.96(2) s
1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-
checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-
defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-
register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load
-fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”
2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-
consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”
8. Algorithm and time profile
N=6
for each time step
gfortran 4.1.2
>98%
Initialize Remove
Move one
model and unphysical Flush I/O End
time step
parameters motions
O(N)
O(N2)
Update Calculate Update Calculate & record
Enforce Enforce
state potential energy state kinetic energy and
temp. & temp. &
by t/2 and forces by t/2 properties
pressure pressure
>59% <31%
O(N)
O(N )
2
Calculate Calculate Calculate Calculate Calculate Add up all
...
bond angle dihedral dispersion charge compo-
interactions interactions interactions interactions interactions nents
9% 12% 8% 37% 26%
9. An unexpected cost
for each time step N=6
Q: WhyRemove15% is
>98%
Initialize
Move one
model and unphysical Flush I/O End
of total execution
time step
parameters motions
O(N ) Text
time spent adding Calculate & record O(N)
2
Update Calculate Update
Enforce Enforce
numbers!?
state potential energy state kinetic energy and
temp. & temp. &
by t/2 and forces by t/2 properties
pressure pressure
>59% <31%
O(N)
O(N )
2
Add up all
Calculate Calculate Calculate Calculate Calculate
... compo-
bond angle dihedral dispersion charge
nents
interactions interactions interactions interactions interactions
9% 12% 8% 37% 26%
10. A: many L2 cache misses
c zero out each of the first derivative components
7 do i = 1, n
do j = 1, 3
42 deb(j,i) = 0.0d0
22 other
...
end do
terms
end do
...
c sum up to get the total energy and first derivatives
energy = eb + ...
do i = 1, n
do j = 1, 3
desum(j,i) = deb(j,i) + ... 22 other
19
terms
2 derivs(j,i) = desum(j,i)
end do
end do
70 of 91 cache misses per time step (n = 6) shown
11. A simple solution
c zero out each of the first derivative components
7 do i = 1, n
do j = 1, 3
26 42 deb(j,i) = 0.0d0
...
end do
end do
...
c sum up to get the total energy and first derivatives
energy = eb + ...
do i = 1, n
do j = 1, 3
6 temp = deb(j,i) + ...
1 19 desum(j,i) = temp
12 derivs(j,i) = temp
end do
end do
reduced cache misses from 92 to 41 per time step
12. Speedup from reducing
L2 cache misses
flags gfortran 4.1.2 ifort 10.0.023
original 29.95(2) s 28.96(2) s
with scalar
27.43(3) s 28.95(1) s
replacement
speedup +8.44(1) % +0.03(2) %
ifort already called with scalar replacement flag
13. Lookup tables (LUTs)
• Calculations of sqrt() and exp() take up
23.8% of execution time
• Idea: pre-compute values of sqrt() and
exp() in an array and recall them from
memory when needed
• Caution: LUT should not displace too much
data from L2 cache
15. exp() with LUT
LUT with first-order Taylor
direct LUT
series refinement*
e =e + (x − x0 )e + O (x − x0 )
x x0 x0 2
16. Choice of
implementation
desired table expected
function refinement
precision size speedup
(doubl
sqrt() 10 -4 10,764 none +118%
es)
exp() 10-8 6,836 Taylor +151%
LUT aligned to 128-bits
L2 cache = 4 MB = 512K doubles
17. Speedup from LUT use
flags gfortran 4.1.2 ifort 10.0.023
original 29.95(2) s 28.96(2) s
with lookup tables 26.89(1) s 25.87(2) s
speedup +10.23(2) % +7.22(3) %
18. Summary of serial
improvements
Improvement gfortran 4.1.2 ifort 10.0.023
Best compiler flags +3.62(3) % +20.22(1) %
L2 cache miss
+8.44(2) % +0.03(1) %
reduction
Lookup tables +10.23(1) % +7.22(2) %
23.91(3) s 26.86(2) s
Total
+20.17(4) % +26.00(2) %
19. Parallelization targets
for each time step N=6
>98%
Initialize Remove
Move one
model and unphysical Flush I/O End
time step
parameters motions
Text O(N)
O(N2)
Update Calculate Update Calculate & record
Enforce Enforce
state potential energy state kinetic energy and
temp. & temp. &
by t/2 and forces by t/2 properties
pressure pressure
>59% <31%
O(N)
O(N )
2
Add up all
Calculate Calculate Calculate Calculate Calculate
... compo-
bond angle dihedral dispersion charge
nents
interactions interactions interactions interactions interactions
9% 12% 8% 37% 26%
20. Parallelization strategy
Calculate
potential energy omp sections
and forces 100%
omp section
50%
omp section
50%
Add up all
Calculate Calculate Calculate Calculate Calculate
... compo-
charge angle dihedral dispersion bond
nents
interactions interactions interactions interactions interactions
50% 16% 2% 12%
11%
omp parallel do omp parallel do omp parallel do
omp parallel do omp parallel do
22. Summary
• Free software can sometimes be better
than non-free software
• L2 cache misses can significantly degrade
performance
• Lookup tables are an effective tradeoff
between speed and memory vs. precision
• Simple OpenMP parallelization is effective
for small numbers of processors