Molecular models, threads and you

Molecular Models,
Threads and You
Optimizing the TINKER classical molecular dynamics code
while maintaining code readability

Jiahao Chen

Martínez Group
Dept. Chemistry, CATMS, MRL and Beckman

CS 498 MG presentation: 2007-12-07

Molecular models/force ﬁelds
Typical energy function

E = covalent bond effects
+

noncovalent interactions

Molecular models/force ﬁelds
Typical energy function

E= kb (rb − req,b )2+ κa (θa − θeq,a )2 + lnd cos (nπ)
d∈dihedrals n
a∈angles
b∈bonds

bond stretch angle torsion dihedrals

+ -
12 6
qi qj σij σij
+ + −
ij
rij rij rij
i<j∈atoms i<j∈atoms
electrostatics dispersion
computation cost = O(N2)

Problem description
• The state of the system is given by the position and
momentum of every atom (of mass mi)
(x1 , p1 , x2 , p2 , · · · , xN , pN ) ∈ R 3×2×N

• Solve the system∂p partial differential equations
of
∂x p ∂E
i i i
= =− , i = 1, · · · , N
,
∂t mi ∂t ∂xi
• with user-specified initial conditions (e.g. with
constant temperature and pressure)
• Subject to (user-specified) constraints, e.g. fixed
bond angles

Many parallel and serial
implementations
Global
Package name Threads MPI
Arrays
NAMD CHARM++
GROMACS ✓ ✓
TINKER
AMBER partly ✓ ✓
CHARMM ✓
LAMMPS ✓
NWChem ✓ ✓

Things I tried

• Compiler ﬂags optimization
• Cache miss reduction
• Lookup tables
• Parallelization with OpenMP

Compiler flag optimization
flags gfortran 4.1.2 ifort 10.0.023
- -
-O0 29.95(2) s 36.30(2) s
32.59(4) s
-Os 29.92(3) s +0.77(3) % +10.22(2) %
32.12(3) s
-O1 30.22(1) s -0.90(4) % +11.51(1) %
-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %
30.83(2) s
-O3 29.84(2) s +0.38(2) % +15.06(2) %
+20.22(1)%2
CE search 28.77(2) s +3.62(3) %1 28.96(2) s
1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-
checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-
defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-
register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load
-fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”
2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-
consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”

Algorithm and time proﬁle
N=6
for each time step
gfortran 4.1.2
>98%
Initialize Remove
Move one
model and unphysical Flush I/O End
time step
parameters motions
O(N)
O(N2)
Update Calculate Update Calculate & record
Enforce Enforce
state potential energy state kinetic energy and
temp. & temp. &
by t/2 and forces by t/2 properties
pressure pressure
>59% <31%
O(N)
O(N )
2

Calculate Calculate Calculate Calculate Calculate Add up all
...
bond angle dihedral dispersion charge compo-
interactions interactions interactions interactions interactions nents
9% 12% 8% 37% 26%

An unexpected cost
for each time step N=6
Q: WhyRemove15% is
>98%
Initialize
Move one
of total execution
time step
parameters motions
O(N ) Text
time spent adding Calculate & record O(N)
2

Update Calculate Update
Enforce Enforce

numbers!?
temp. & temp. &
pressure pressure
>59% <31%
O(N)
O(N )
2

Add up all
Calculate Calculate Calculate Calculate Calculate
... compo-
bond angle dihedral dispersion charge
nents
interactions interactions interactions interactions interactions
9% 12% 8% 37% 26%

A: many L2 cache misses
c zero out each of the first derivative components
7 do i = 1, n
do j = 1, 3
42 deb(j,i) = 0.0d0
22 other
...
end do
terms
end do
...
c sum up to get the total energy and first derivatives
energy = eb + ...
do i = 1, n
do j = 1, 3
desum(j,i) = deb(j,i) + ... 22 other
19
terms
2 derivs(j,i) = desum(j,i)
end do
end do
70 of 91 cache misses per time step (n = 6) shown

A simple solution
c zero out each of the first derivative components
7 do i = 1, n
do j = 1, 3
26 42 deb(j,i) = 0.0d0
...
end do
end do
...
c sum up to get the total energy and first derivatives
energy = eb + ...
do i = 1, n
do j = 1, 3
6 temp = deb(j,i) + ...
1 19 desum(j,i) = temp
12 derivs(j,i) = temp
end do
end do
reduced cache misses from 92 to 41 per time step

Speedup from reducing
L2 cache misses

original 29.95(2) s 28.96(2) s

with scalar
27.43(3) s 28.95(1) s
replacement
speedup +8.44(1) % +0.03(2) %

ifort already called with scalar replacement ﬂag

Lookup tables (LUTs)

• Calculations of sqrt() and exp() take up
23.8% of execution time
• Idea: pre-compute values of sqrt() and
exp() in an array and recall them from
memory when needed
• Caution: LUT should not displace too much
data from L2 cache

sqrt() with LUT
direct LUT LUT with linear interpolation

exp() with LUT
LUT with ﬁrst-order Taylor
direct LUT
series reﬁnement*

e =e + (x − x0 )e + O (x − x0 )
x x0 x0 2

Choice of
implementation
desired table expected
function reﬁnement
precision size speedup
(doubl
sqrt() 10 -4 10,764 none +118%
es)

exp() 10-8 6,836 Taylor +151%

LUT aligned to 128-bits
L2 cache = 4 MB = 512K doubles

Speedup from LUT use

original 29.95(2) s 28.96(2) s

with lookup tables 26.89(1) s 25.87(2) s

speedup +10.23(2) % +7.22(3) %

Summary of serial
improvements
Improvement gfortran 4.1.2 ifort 10.0.023

Best compiler ﬂags +3.62(3) % +20.22(1) %
L2 cache miss
+8.44(2) % +0.03(1) %
reduction
Lookup tables +10.23(1) % +7.22(2) %
23.91(3) s 26.86(2) s
Total
+20.17(4) % +26.00(2) %

Parallelization targets
for each time step N=6
>98%
Initialize Remove
Move one
time step
parameters motions

Text O(N)
O(N2)
Update Calculate Update Calculate & record
Enforce Enforce
temp. & temp. &
pressure pressure
>59% <31%
O(N)
O(N )
2

Add up all
... compo-
bond angle dihedral dispersion charge
nents
9% 12% 8% 37% 26%

Parallelization strategy
Calculate
potential energy omp sections
and forces 100%
omp section
50%
omp section
50%
Add up all
... compo-
charge angle dihedral dispersion bond
nents
50% 16% 2% 12%
11%
omp parallel do omp parallel do omp parallel do
omp parallel do omp parallel do

Parallelization results
gfortran 4.1.2
35

N=6
N=1000
Ideal
30
Execution time/s

25

20

15

10

# cores
5
0.5 1 1.5 2 2.5 3 3.5 4 4.5

Summary
• Free software can sometimes be better
than non-free software
• L2 cache misses can signiﬁcantly degrade
performance
• Lookup tables are an effective tradeoff
between speed and memory vs. precision
• Simple OpenMP parallelization is effective
for small numbers of processors

Molecular models, threads and you

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Molecular models, threads and you

Ähnlich wie Molecular models, threads and you (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Molecular models, threads and you