3. ABOUT
NX
NASTRAN
! Industry
standard
finite
element
package
from
Siemens
PLM
! Analysis
opSons
include:
‒ Stress,
vibraSon,
structural
failure
‒ Heat
transfer,
acousScs,
rotor
dynamics,
and
more
! Advanced
numerical
capabiliSes
and
proven
scalability:
‒ Problem
sizes
approaching
1
billion
dofs
‒ SMP
to
24
cores
‒ DMP
to
2048
nodes
3
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
4. MODAL
FREQUENCY
RESPONSE
OVERVIEW
NASTRAN
SOL
111
! Bread
and
buer
industrial
computaSon:
modal
frequency
response
! Widely
used
in
automoSve
&
aerospace
to
determine
response
under
varying
excitaSons
‒ OpSmize
weight,
rigidity
‒ Minimize
noise,
resonance
! Two
phase
calculaSon
more
efficient
than
direct:
‒ Modal
analysis
‒ Frequency
response
calculaSon
4
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
5. MODAL
FREQUENCY
RESPONSE
COMPUTATIONAL
STEPS
! EigensoluSon
-‐-‐
ℎ
normal
modes
of
𝑓× 𝑓
structural
matrices:
𝐾↓𝑓𝑓 Φ↓𝑓ℎ = 𝑀↓𝑓𝑓 Φ↓𝑓ℎ Λ↓ℎℎ
! Frequency
response
-‐-‐
ℎ×ℎ
complex
linear
soluSon
at
each
of
𝑛 𝑟𝑒𝑠𝑝
frequencies:
( 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ ) 𝑥↓𝑘 = 𝑏↓𝑘 , 𝑘=1,…, 𝑛𝑟𝑒𝑠𝑝
! All
parameters
large
in
typical
customer
usage:
‒ 𝑓-‐size
10-‐30M
for
model
fidelity
‒ ℎ-‐size
10-‐60K
for
modal
accuracy
‒ 𝑛𝑟𝑒𝑠𝑝
20K
for
detailed
response
graph
5
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
6. PERFORMANCE
CASE
STUDY
PR
MODEL
–
FREQUENCY
RESPONSE
COST
! Shell
dominated
SOL
111
model
‒ 245K
degrees
of
freedom
( 𝑓-‐size)
‒ 1200
eigenpairs
(ℎ-‐size)
‒ 20K
frequency
responses
( 𝑛𝑟𝑒𝑠𝑝)
! EigensoluSon
Sme:
30
minutes
! Frequency
response:
127
minutes
! Frequency
response
cost
𝑂( 𝑛𝑟𝑒𝑠𝑝 ∗ℎ↑3 )
‒ EsSmated
run
Sme
in
decades
as
ℎ→60 𝐾
6
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
7. PERFORMANCE
CASE
STUDY
CUSTOMER
BENCHMARK
! More
typical
industrial
model:
‒ 11
million
degrees
of
freedom
( 𝑓-‐size)
‒ Shell
dominated
model
‒ Approximately
3000
eigenpairs
(ℎ-‐size)
‒ 300
frequency
responses
( 𝑛𝑟𝑒𝑠𝑝)
! Frequency
response
expensive,
but
modal
calculaSon
sSll
expensive
even
with
RDMODES:
‒ Modal
calculaSon:
375
minutes
‒ Frequency
response
Sme:
22
minutes
! Need
to
improve
performance
in
both
phases
7
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
9. FREQUENCY
RESPONSE
IMPLEMENTATION
DETAILS
OF
ORIGINAL
METHOD
! NX
Nastran
implementaSon
uses
symmetric
𝐿 𝐷 𝐿↑𝑇
factorizaSon
and
forward-‐backward
subsStuSon:
For
𝑘=1,…, 𝑛𝑟𝑒𝑠𝑝
Assemble
𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ
Factor
𝐴= 𝐿𝐷 𝐿↑𝑇
Solve
𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝐿↑− 𝑇 𝐷↑−1 𝐿↑−1 𝑏↓𝑘
End
for
! NX
Nastran
sparse
factorizaSon
difficult
to
adapt
to
GPU:
‒
Disk
oriented
‒ Tuned
for
sparse
matrices
‒ Symmetric
pivoSng
required
for
stability
(indefiniteness)
9
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
10. FREQUENCY
RESPONSE
IMPLEMENTATION
DETAILS
OF
REVISED
METHOD
! For
GPU
code,
use
LU
factorizaSon
instead:
For
𝑘=1,…, 𝑛𝑟𝑒𝑠𝑝
Assemble
𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ
Factor
𝐴= 𝐿𝑈
Solve
𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝑈↑−1 𝐿↑−1 𝑏↓𝑘
End
for
! OpenCL
port
of
LAPACK
zgesv
available
with
clMAGMA
and
clBLAS
‒ In
core
storage
‒ Dense
oriented
(okay
for
this
applicaSon)
‒ Benefit
mainly
in
factorizaSon
step
(cubic
operaSon
count)
10
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
11. FREQUENCY
RESPONSE
IMPLEMENTATION
LINEAR
SOLVER
SELECTION
STRATEGY
! Original
NX
Nastran
sparse
symmetric
solver
‒ Spills
to
disk,
requires
minimal
memory
‒ Minimizes
flops
by
uSlizing
symmetry
‒ Takes
advantage
of
sparsity
! Improved
SMP
method
(system462=1
in
NXN9.0)
‒ In
core,
based
on
LAPACK
zsytrf/zsytrs
‒ Efficient
parallelizaSon
of
𝑛 𝑟𝑒𝑠𝑝
loop
‒ Large
memory
requirements
! OpenCL
method
(to
appear
in
NXN9
MP)
‒ In
core,
based
on
clMAGMA
zgesv (LU
factorizaSon)
‒ USlizing
GPU
for
best
performance
11
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
12. FREQUENCY
RESPONSE
INITIAL
PERFORMANCE
COMPARISON
! Test
machine:
‒ Magny-‐Cours
2.1
GHz,
24
cores
‒ 32GB
memory
‒ 4GB
TahiS
GPU
! GPU
roughly
40%
faster
than
24-‐way
SMP
Model
Modes
e10k
1785
e20k
3631
e30k
5576
e40k
2:24:00
2:09:36
serial
1:55:12
smp=8
1:40:48
smp=24
1:26:24
GPU
1:12:00
0:57:36
0:43:12
7646
0:28:48
0:14:24
0:00:00
12
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
e10k
e20k
e30k
e40k
13. FREQUENCY
RESPONSE
–
FURTHER
IMPROVEMENTS
SINGLE
PRECISION
ARITHMETIC
! Use
single
precision
on
GPU
for
improved
performance
‒ Higher
flop
rate
(typically
4-‐5
Smes)
‒ Lower
memory
uSlizaSon
‒ (larger
dimension
problems
possible)
‒ Beer
scaling
with
larger
systems
‒ Single
precision
disadvantage:
lower
precision
‒ Accuracy
acceptable
for
most
engineering
purposes
‒ (largest
relaSve
error
of
10↑−5 )
1
Double
precision
0.1
0.01
0.001
0.0001
0.00001
0.000001
0.0000001
1E-‐08
13
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
Single
precision
RelaSve
error
14. FREQUENCY
RESPONSE
–
FURTHER
IMPROVEMENTS
SINGLE
PRECISION
ACCURACY
AND
PERFORMANCE
! 40-‐50%
reducSon
in
run
Sme
0:17:17
! Largest
example
only
possible
in
single
precision
0:14:24
Double
Single
0:11:31
0:08:38
Model
Modes
e10k
1785
0:05:46
e20k
3631
0:02:53
e30k
5576
e40k
7646
e60k
12088
14
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
0:00:00
e10k
e20k
e30k
e40k
e60k
15. FREQUENCY
RESPONSE
–
FURTHER
IMPROVEMENTS
MATRIX
SUMMATION
ON
GPU
! Perform
addiSon
of
matrices
at
each
frequency
on
GPU
(assembly
step)
𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ
! I.e.
store
𝐾↓ℎℎ , 𝐵↓ℎℎ , 𝑀↓ℎℎ
in
GPU
buffers
and
sum
using
zaxpy/saxpy kernels:
𝐴≔ 𝐾↓ℎℎ
𝐴≔ 𝐴+ 𝜔↓𝑘 𝑖 𝐵↓ℎℎ
𝐴≔ 𝐴− 𝜔↓𝑘↑2 𝑀↓ℎℎ
! Minimizes
data
transfer
to/from
main
memory
! AddiSonal
GPU
memory
consumpSon
15
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
16. FREQUENCY
RESPONSE
–
FURTHER
IMPROVEMENTS
MATRIX
SUMMATION
ON
GPU
PERFORMANCE
! Double
precision
best
result
(e30k):
‒ Time
reduced
30%
from
6:52
to
4:50
‒ 2x
faster
than
best
CPU
Sme
0:12:58
0:11:31
0:10:05
0:08:38
! Single
precision
best
result
(e40k):
‒ Time
reduced
22%
from
6:23
to
4:58
‒ 4x
faster
than
best
CPU
Sme
0:07:12
Double
Double
+
zaxpy
Single
Single
+
caxpy
0:05:46
0:04:19
0:02:53
! Best
scaling
with
largest
problems
‒ Limited
by
GPU
memory
0:01:26
0:00:00
e10k
16
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
e20k
e30k
e40k
18. MODAL
ANALYSIS
WITH
RDMODES
OVERVIEW
! RDMODES
–
proprietary
high-‐performance
approximate
eigensolver
! Tuned
for
typical
customer
use
cases:
‒ Larger
models
(10
million+
dofs)
‒ Many
modes
(300+)
‒ Accelerated
computaSon
when
few
output
dofs
required
‒ Sufficient
accuracy
for
frequency
response
calculaSons
! Performance
up
to
20x
faster
than
Lanczos
! Demonstrated
DMP
scalability
to
2048
nodes
18
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
19. MODAL
ANALYSIS
WITH
RDMODES
COST
BREAKDOWN
! RDMODES
method
comprised
of
mulSple
smaller
operaSons
–
five
areas
listed
below
! Costs
for
customer
benchmark:
! Dense
operaSons
good
candidates
for
GPU
‒ FactorizaSon,
eigensoluSon
19
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
Wall
?me
Sparse
factorizaSon
18:40
Dense
factorizaSon
24:00
Sparse
eigensoluSon
9:33
Dense
eigensoluSon
‒ 11
million
dofs
‒ Shell
dominated
‒ 3000
modes
below
400
Hz
‒ 300
frequency
responses
Opera?on
65:00
Reduced
(dense)
eigensoluSon
21:16
Total
250:06
20. RDMODES
FACTORIZATION
CLASSIFICATION
! Fairly
large
quanSty
of
each
type
! Sparse
factorizaSons:
‒ Typically
too
large
to
treat
efficiently
as
dense
‒ NXN
mulSfrontal
solver
very
efficient
‒ Efficient
sparse
soluSon
on
GPU
difficult
(acSve
research)
! Dense
factorizaSons:
‒ Model
dependent,
typically
small
‒ Symmetric
posiSve
definite,
may
use
clMAGMA
dposv
‒ Candidate
for
GPU
20
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
21. RDMODES
FACTORIZATION
DENSE
FACTORIZATION
COST
COMPARISON
! Dense
factorizaSon
wall
Smes
‒ Costs
include
factorizaSon
and
miscellaneous
assembly
Dense
factoriza?on
?mes
0:25:55
0:23:02
! As
with
frequency
response,
GPU
suitable
above
0:20:10
threshold
NXN
0:17:17
‒ Threshold
of
5000
for
this
example
! Dense
in
core
methods
helpful
LAPACK
GPU
0:14:24
0:11:31
0:08:38
0:05:46
! GPU
ineffecSve
for
this
model
‒ (all
linear
soluSons
relaSvely
small)
0:02:53
0:00:00
Serial
21
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
SMP=24
22. RDMODES
EIGENSOLUTION
CLASSIFICATION
! Sparse
eigensoluSons:
‒ Large
number
‒ Sparse,
relaSvely
large
dimension
‒ Inexpensive
with
NXN
sparse
eigensolvers
! Dense
eigensoluSons:
‒ Large
number
‒ Dense,
small-‐medium
dimension
‒ Candidate
for
GPU
! Reduced
eigensoluSon:
‒ Only
one
instance
‒ Dense,
fairly
large,
many
modes
‒ Strong
candidate
for
GPU
22
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
23. RDMODES
EIGENSOLUTION
DENSE
SOLUTION
METHODS
! Householder
type
soluSon
for
real
symmetric
problem
(dsyev):
‒ Reduce
to
tridiagonal:
‒ Eigenvalues
of
tridiagonal:
‒ Compute
eigenvectors:
‒ Then
𝑄↑𝑇 𝐴𝑄= 𝑇
𝑍↑𝑇 𝑇𝑍=Λ
Φ= 𝑄𝑍
𝐴Φ=ΦΛ
! Efficient
choice
for
dense
problems,
and/or
many
eigenvectors
needed
‒ High
memory
consumpSon
! Transform
generalized
eigenvalue
problem
as
follows:
‒ Factor:
‒ Solve:
‒ Generalized
eigensoluSon:
𝑀= 𝐿 𝐿↑𝑇
𝐿↑−1 𝐾 𝐿↑− 𝑇 𝑋= 𝑋Λ
𝐾( 𝐿↑− 𝑇 𝑋)= 𝑀( 𝐿↑− 𝑇 𝑋)Λ
23
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
24. RDMODES
EIGENSOLUTION
DENSE
EIGENSOLUTION
SCALABILITY
! Dimensions
range
from
2800
to
8800
‒ Dense
problems,
modes
variable
! GPU
beneficial
for
larger
sizes
! Total
Smes
(serial)
-‐-‐
50%
reducSon:
‒ 56:29
‒ 15:30
‒ 7:29
(all
Lanczos)
(all
LAPACK)
(using
GPU)
2:24:00
Serial
0:14:24
0:01:26
Lanczos
LAPACK
GPU
0:00:09
0:00:01
2000
2:24:00
4000
8000
SMP=24
0:14:24
0:01:26
! Total
Smes
(SMP)
–
36%
reducSon:
‒ 52:22
‒ 4:41
‒ 3:00
(all
Lanczos)
(all
LAPACK)
(using
GPU)
Lanczos
LAPACK
GPU
0:00:09
0:00:01
2000
24
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
4000
8000
25. RDMODES
EIGENSOLUTION
GPU
SUPPORT
! Householder
methods
well
suited
(as
expected)
! Larger
dimension
dense
problems
benefit
from
the
GPU
‒ And
are
the
most
Sme
consuming
! Send
most
expensive
problems
to
GPU
! Threshold
set
to
3800
for
this
test
‒ Note:
opSmal
threshold
depends
on
hardware
and
SMP
25
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
26. RDMODES
EIGENSOLUTION
MOST
SIGNIFICANT
COST
COMPONENTS
! Reduced
eigensoluSon
‒ Not
ideally
suited
to
NXN
Lanczos
eigensolver
‒ Unique,
but
large
(14K
dofs)
‒ Many
eigenvectors
needed
‒ GPU
30%
speedup
(both
SMP
and
serial)
! GPU
in
RDMODES
conclusions
‒ Dense
and
reduced
eigensoluSons
benefit
‒ Threshold
for
dense
eigensoluSon
‒ Dense
factorizaSon
benefits
from
LAPACK:
lile
addiSonal
benefit
on
GPU
! Sparse
methods
not
supported
yet
Reduced
Eigensolu?on
0:57:36
NXN
LAPACK
GPU
0:50:24
0:43:12
0:36:00
0:28:48
0:21:36
0:14:24
0:07:12
0:00:00
26
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
Serial
SMP=24
27. RDMODES
AND
FREQUENCY
RESPONSE
BENCHMARK
PERFORMANCE
RESULTS
! SMP=24,
customer
benchmark
8:24:00
Frequency
response
7:12:00
! Compared
to
NXN
system:
‒ Frequency
response
3x
faster
‒ Reduced
eigensoluSon
2.8x
faster
‒ FactorizaSon
28%
faster
‒ Dense
eigensoluSon
9x
faster
‒ 30%
reducSon
in
total
run
Sme
Reduced
eigensoluSon
6:00:00
4:48:00
Dense
eigensoluSon
3:36:00
FactorizaSon
2:24:00
Other
1:12:00
! Compared
to
LAPACK:
‒ Frequency
response
3x
faster
‒ Reduced
eigensoluSon
2x
faster
‒ 10%
reducSon
in
total
run
Sme
0:00:00
27
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
NXN
LAPACK
GPU
28. RDMODES
EIGENSOLUTION
SINGLE
PRECISION
! Performance
advantages
with
single
precision
eigensoluSon
‒ As
with
linear
soluSon
in
frequency
response,
single
precision
faster
on
GPU
‒ Lower
GPU
memory
consumpSon
‒ (larger
problems)
! Dense
eigensoluSons
(customer
benchmark)
–
35-‐40%
speedup:
Double
precision
Single
precision
7:01
4:16
SMP=24
3:41
2:23
Serial
! Reduced
eigensoluSon
also
benefits
–
20%
speedup:
‒ 3:05
to
2:29
28
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL
30. CONCLUSIONS
! Significant
benefit
with
GPU
for
certain
computaSon
types
‒ Frequency
response
calculaSon
2x-‐3x
faster,
dense
eigensoluSon
2x
faster
‒ AddiSonal
35-‐50%
improvement
possible
with
single
precision
‒ 30%
lower
turnaround
Sme
for
typical
customer
benchmark
! Efficient
dense
matrix
algebra
on
GPU
with
clMath,
clMAGMA
! Many
thanks
to:
Ben-‐Shan
Liao,
Wei
Zhang
(Siemens
PLM),
Antoine
Reymond
(AMD)
Thank
you!
30
|
FAST
MODAL
ANALYSIS
WITH
NX
NASTRAN
AND
GPUS
|
NOVEMBER
12,
2013
|
CONFIDENTIAL