Parallel iterative solution of the hermite collocation equations on gpus
1. Parallel Iterative solution of the
Hermite Collocation Equations
on GPUs
Emmanuel N. Mathioudakis
Co-Authors : Elena Papadopoulou – Yiannis Saridakis – Nikolaos Vilanakis
TECHNICAL UNIVERSITY OF CRETE
DEPARTMENT OF SCIENCES
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
73100 CHANIA - CRETE - GREECE
2. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Talk Overview
Hermite Collocation for elliptic BVPs &
Derivation of the Collocation linear system
Development of a parallel algorithm for the
Schur Complement method on Shared
Memory Parallel Architectures
Parallel implementation on multicore computers
with GPUs
3. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
( , ) ( , ) , ( , )
( , ) ( , ) , ( , )
u x y f x y x y
u x y g x y x y
L
BBVP
( , ) ( , ) 0 , ( , )
:
( , ) ( , ) 0 , (
,
, )
y y yx x x
i j i j i j
y y yx x x
i j i j i j
a
i j
u f
u g
L
B
Hermite Collocation Method
4. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Hermite Collocation Method…
5. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Hermite Collocation Method…
6. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
7. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
8. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
9. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
10. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
11. with 0
( , ) ( , ) ( , ) , ( , )
( , ) ( , ) , ( , )
u x y u x y f x y x y
u x y g x y x y
2
Model
Problem
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
13. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
14. The collocation matrix is large, sparse and enjoys no pleasant
properties (e.g. symmetric, definite)
Iterative
+
Parallel
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
15. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Iterative Solution
with
16. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Iterative Solution
17. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Eigenvalues of Collocation matrix
18. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Schur Complement Iterative Solution
19. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel Iterative Solution of Collocation Linear system
on Shared Memory Architectures
Uniform Load Balancing between core threads
Minimal Idle Cycles of core threads
Minimal Communication
20. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
( )
1
Rx ( )
2
Rx ( )
3
Rx ( )
4
Rx
21. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
( )
1
B
p
x
( )
2
B
p
x
( )
3
B
p
x
( )
4
B
p
x
22. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
V1
( )
1
Rx ( )
2
Rx ( )
3
Rx ( )
4
Rx( )
1
B
p
x
( )
2
B
p
x
( )
3
B
p
x
( )
4
B
p
x
V2 V3 V4 V5 V6 V7 V8
24. Mapping into a fixed size Architecture of N Cores
case of k = 2p/N even 2k virtual threads
l=(j-1)k+1,…,jk
l=2p+(j-1)k+1,…,2p+jk
j=1,…,N
( )B
l
x
( )R
l
x
V2 . . .
Pj
25. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel Schur Complement Iterative Solution
26. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel BiCGSTAB
27. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel BiCGSTAB
28. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
The Dirichlet Helmholtz Problem
2100( 0.1)2with
( , ) 10 ( ) ( ) , ( , ) [0,1] [0,1]
( ) ( ) x
u x y x y x y
x x x e
29. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
HP SL390s
6 core Xeon@2.8GHz
24GB memory
Oracle Linux 6.3 x64
PGI 13.5 Fortran
PCI-e gen2 x16
HP SL390s – Tesla M2070 GPUs
+ 2 x
30. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Iterations / Error measurements
ns
BiCGSTAB
Iterations
|| b – Axn||2
256 294 6.06e-11
512 589 2.85e-11
1024 1161 1.39e-11
2048 3726 9.59e-12
31. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
32. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
33. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
34. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
35. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
36. TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine
Time measurements
37. Conclusions
• A new parallel algorithm implementing the Schur
complement with BiCGSTAB iterative method for
Hermite Collocation equations has been developed.
• The algorithm is realized on Shared Memory multi-core
machines with GPUs .
• A performance acceleration of up to 30% is observed.
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
38. Future work
• Design an efficient parallel Schur complement
algorithm of the Hermite Collocation equations
for Multiprocessor /Grid machines with GPUs.
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY