The Ultimate Guide to Choosing WordPress Pros and Cons
Simulation Informatics; Analyzing Large Scientific Datasets
1. Simulation Informatics!
Analyzing Large Datasets
from Scientific Simulations
DAVID F. GLEICH ! PAUL G. CONSTANTINE!
PURDUE UNIVERSITY
STANFORD UNIVERSITY
COMPUTER SCIENCE ! JOE RUTHRUFF!
DEPARTMENT
& JEREMY TEMPLETON !
SANDIA NATIONAL LABS
1
David Gleich · Purdue
CS&E Seminar
2. This talk is a story …
2
David Gleich · Purdue
CS&E Seminar
3. How I learned to stop
worrying and love the
simulation!
3
David Gleich · Purdue
CS&E Seminar
4. I asked …!
Can we do UQ on
PageRank?
4
David Gleich · Purdue
CS&E Seminar
5. PageRank by Google
Google’s PageRank
PageRank by Google
3
3
The Model
2 5 1.The Model uniformly with
follow edges
2
4
5 1. follow edges uniformly with
probability , and
4
2. randomly jump, with probability
probability and
1 6
2. randomlyassume everywhere is
1 , we’ll jump with probability
1 6 equally, likely assume everywhere is
1 we’ll
equally likely
The places we find the
The places we find the
surfer most often are im-
portant pages. often are im-
surfer most
portant pages.
5
David F. Gleich (Sandia) PageRank intro
David Gleich · Purdue
CS&E Seminar
/ 36
Purdue 5
6. h sensitivity?
alpha alpha PageRank
PageRa
PageRank
RandomPageRank
dom alpha
Random alpha
RAPr
or PageRank meets UQ
( P)x = (1 )v
s the random variables as the random variables
Model PageRank
ageRank as the random variables
y to the links : examined and understoo
x(A) x(A)
x(A)
and look at
k E [x(A)] and Std [x(A)] .
at
E [x(A)] and Std [x(A)] .
y to the E [x(A)]: and Std [x(A)] .understood,
jump examined,
Explored in Constantine and Gleich, WAW2007; and "
Constantine and Gleich, J. Internet Mathematics 2011.
6
David Gleich · Purdue
CS&E Seminar
7. Random alpha PageRank has
Convergence theory
a rigorous convergence theory.
Method Conv. Work Required What is N?
1 number of
Monte Carlo p N PageRank systems
N samples from A
Path Damping
r N+2 N + 1 matrix vector terms of
(without
N1+ products Neumann series
Std [x(A)])
number of
Gaussian
r 2N N PageRank systems quadrature
Quadrature
points
and r are parameters from Bet ( , b, , r)
7
David F. Gleich (Sandia) David
Random sensitivity Gleich · Purdue
CS&E Seminar
/ 36
Purdue 27
9. Constantine, Gleich, and Iaccarino.
We studied Spectral Methods for Parameterized
Matrix Equations, SIMAX, 2010.
parameterized
A(s)x(s) = b(s)
matrices.
, A(J 1 )x(J 1 ) = b(J 1 )
) A(J N )x(J N ) = b(J N ) or
Parameterized
Solution
) AN (J 1 )xN (J 1 ) = bN (J 1 )
Constantine, Gleich, and Iaccarino. A
A(s)x(s) = b(s) factorization of the spectral Galerkin
system for parameterized matrix
equations: derivation and applications,
SISC 2011.
How to compute the Galerkin solution
Discretized PDE in a weakly intrusive manner.!
with explicit
parameters
9
David Gleich · Purdue
CS&E Seminar
10. Simulation!
The Third Pillar of Science
21st Century Science in a nutshell!
Experiments are not practical or feasible.
Simulate things instead.
But do we trust the simulations?!
We’re trying!
Model Fidelity
Verification & Validation (V&V)
Uncertainty Quantification (UQ)
10
David Gleich · Purdue
CS&E Seminar
11. The message
Insight and confidence
requires multiple runs.
11
David Gleich · Purdue
CS&E Seminar
14. Large scale nonlinear, time
dependent heat transfer problem
105 nodes
103 time steps
30 minutes on 16 cores
Questions
What is the probability of failure?
Which input values cause failure?
14
David Gleich · Purdue
CS&E Seminar
15. It’s time to ask "
What can science
learn from Google?"
"
- Wired Magazine (2008)
15
David Gleich · Purdue
CS&E Seminar
16. We can throw the numbers
21.1st Century Science
into the biggest computing in a nutshell?
clusters the world has ever Simulations are "
seen and let statistical too expensive.
algorithms find patterns
Let data provide a
where science cannot.
surrogate.
- Wired (again)
16/18
David Gleich · Purdue
CS&E Seminar
17. Our approach!
Construct an interpolating
reduced order model from a
budget-constrained ensemble of
runs for uncertainty and
optimization studies.
17
David Gleich · Purdue
CS&E Seminar
18. That is, we store the runs
Supercomputer Data computing cluster Engineer
Each multi-day HPC A data cluster can … enabling engineers to query
simulation generates hold hundreds or thousands and analyze months of simulation
gigabytes of data. of old simulations … data for statistical studies and
uncertainty quantification.
and build the interpolant from
the pre-computed data.
18
David Gleich · Purdue
CS&E Seminar
19. The Database
Input " Time history" s1 -> f1
Parameters
of simulation
s2 -> f2
s
f
sk -> fk
2 3 A single simulation
The simulation as a vector
q(x1 , t1 , s)
6 .
. 7 at one time step
6 . 7
6 7
6q(xn , t1 , s)7
6 7
6q(x1 , t2 , s)7
6 7 ⇥ ⇤
f(s) = 6 . 7
6
6
.
. 7
7 X = f(s1 ) f(s2 ) ... f(sp )
6q(xn , t2 , s)7
6 7
6 . 7 The database as a matrix
4 .
. 5
q(xn , tk , s)
19
David Gleich · Purdue
CS&E Seminar
20. The interpolant
Motivation!
This idea was inspired by
Let the data give you the basis.
the success of other
⇥ ⇤ reduced order models
X = f(s1 ) f(s2 ) ... f(sp ) like POD; and Paul’s
residual minimizing idea.
Then find the right combination
Xr
f(s) ⇡ uj ↵j (s)
j=1
These are the left singular
vectors from X!
20
David Gleich · Purdue
CS&E Seminar
21. Why the SVD?!
Let’s study a simple case.
2 3
g(x1 , s1 ) g(x1 , s2 ) ··· g(x1 , sp )
6 .. .. .
. 7
6 g(x2 , s1 ) . . . 7
X=6
6 .
7
7
4 . .. ..
. . . g(xm 1 , sp )5 treat each right
g(xm , s1 ) g(xm , sp singular vector
··· 1) g(xm , sp ).
as samples of
= U⌃VT , the unknown
r
X r
X basis functions
g(xi , sj ) = Ui,` ` Vj,` = u` (xi ) ` v` (sj )
`=1 `=1 split x and s
a general parameter
r p
X X (`)
g(xi , s) = u` (xi ) ` v` (s) v` (s) ⇡ v` (sj ) j (s)
`=1 j=1
Interpolate v any way you wish
21
David Gleich · Purdue
CS&E Seminar
22. Method summary
Compute SVD of X!
Compute interpolant of right singular vectors
Approximate a new value of f(s)!
22
David Gleich · Purdue
CS&E Seminar
23. A quiz!
Which section would you rather
try and interpolate, A or B?
A
B
23
David Gleich · Purdue
CS&E Seminar
24. How predictable is a !
singular vector?
Folk Theorem (O’Leary 2011)
The singular vectors of a matrix of “smooth” data
become more oscillatory as the index increases.
Implication!
The gradient of the singular vectors increases as
the index increases.
v1 (s), v2 (s), ... , vt (s)
vt+1 (s), ... , vr (s)
Predictable
Unpredictable
24
David Gleich · Purdue
CS&E Seminar
25. A refined method with !
an error model
Don’t even try to
interpolate the
predictable modes.
t(s) r
X X
f(s) ⇡ uj ↵j (s) + uj j ⌘j
j=1 Predictable
j=t(s)+1 Unpredictable
⌘j ⇠ N(0, 1)
0 1
r
X
TA
Variance[f] = diag @ j uj uj
j=t(s)+1
But now, how to choose t(s)?
25
David Gleich · Purdue
CS&E Seminar
26. Our current approach to
choosing the predictability
t(s) is the largest ������ such that
⌧
X
1 @vi
i < threshold
1 @s
i=1
26
David Gleich · Purdue
CS&E Seminar
27. An experimental test case
A heat equation
problem
Two parameters
that control the
material properties
27
David Gleich · Purdue
CS&E Seminar
29. Our Reduced Order Model
Where the error is the worst
The Truth
29
David Gleich · Purdue
CS&E Seminar
30. A Large Scale Example
Nonlinear heat transfer model
80k nodes, 300 time-steps
104 basis runs
SVD of 24m x 104 data matrix
500x reduction in wall clock time
(100x including the SVD)
30
David Gleich · Purdue
CS&E Seminar
32. Quick review of QR
QR Factorization
Let , real Using QR for regression
is given by
the solution of
QR is block normalization
is orthogonal ( ) “normalize” a vector
usually generalizes to
computing in the QR
is upper triangular.
0
A = Q
R
32
David Gleich (Sandia) David
MapReduce 2011 Gleich · Purdue
CS&E Seminar
4/22
33. Intro to MapReduce
Originated at Google for indexing web Data scalable
pages and computing PageRank.
Maps
M M
1
2
1
M
The idea Bring the Reduce
2
M M M
computations to the data.
R 3
4
3
M
R
M M
Express algorithms in "
4
5
5
M Shuffle
data-local operations.
Fault-tolerance by design
Implement one type of Input stored in triplicate
communication: shuffle.
M
Reduce input/"
output on disk
M
Shuffle moves all data with M
R
the same key to the same M R
reducer.
Map output"
persisted to disk"
33
before shuffle
David Gleich · Purdue
CS&E Seminar
34. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
34
David Gleich · Purdue
CS&E Seminar
35. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
Bring the computations
to the data!
35
David Gleich · Purdue
CS&E Seminar
36. Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)
First, do QR Second, compute
factorizations a QR factorization
of each local of the new “R”
matrix
36
Demmel et al.David Communicating avoiding CS&E and sequential QR.
2008. Gleich · Purdue
parallel Seminar
37. Serial QR factorizations!
Fully serialet al. 2008)
(Demmel TSQR
Compute QR of ,
read , update QR, …
37
Demmel et al. 2008. Communicating avoiding
parallel and sequential QR.
David Gleich · Purdue CS&E Seminar
38. Tall-and-skinnymatrix storage
MapReduce matrix
storage in MapReduce
A1
Key is an arbitrary row-id
Value is the array for A2
a row.
A3
Each submatrix is an
input split.
A4
38
David Gleich (Sandia) MapReduce 2011 10/2
David Gleich · Purdue
CS&E Seminar
39. Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4
A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7
A8 qr emit
A8 Q8 R8
R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R
39
David Gleich · Purdue
CS&E Seminar
40. Key Limitations
Computes only R and not Q
Can get Q via Q = AR+ with another MR iteration. "
(we currently use this for computing the SVD)
Dubious numerical stability; iterative refinement helps.
Working on better ways to compute Q "
(with Austin Benson, Jim Demmel)
40
David Gleich · Purdue
CS&E Seminar
41. In hadoopy
Full code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)
def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
41
David Gleich (Sandia) MapReduce 2011 13/22
David Gleich · Purdue
CS&E Seminar
42. Lots many maps? an iteration.
Too of data? Add Add an iteration!
map emit reduce emit reduce emit
R1 R2,1 R
A1 Mapper 1-1
S1 Reducer 1-1
S(2)
A2 Reducer 2-1
Serial TSQR Serial TSQR Serial TSQR
shuffle
identity map
map emit reduce emit
R2 R2,2
A2 Mapper 1-2
S(1) A2
S Reducer 1-2
shuffle
Serial TSQR Serial TSQR
A
map emit reduce emit
R3 R2,3
A3 Mapper 1-3
A2
S3 Reducer 1-3
Serial TSQR Serial TSQR
map emit
R4
A3
4 Mapper 1-4
Serial TSQR
Iteration 1 Iteration 2
42
David Gleich (Sandia) MapReduce 2011 14/22
David Gleich · Purdue
CS&E Seminar
43. mrtsqr – of parameters
parameters
Summary summary of
Blocksize How many rows to
A1 A1
read before computing a QR
qr
factorization, expressed as a A2 A2 Q2
multiple of the number of
columns (See paper)
map emit
R1
Splitsize The size of each local A1 Mapper 1-1
matrix Serial TSQR
Reduction tree
(Red)
S(2)
The number of
(Red)
(Red) S(2)
shuffle
reducers and S(1)
A
iterations to use
Iteration 1 Iter 2 Iter 3
43
David Gleich (Sandia) MapReduce 2011
David 15/22
Gleich · Purdue
CS&E Seminar
44. Varying splitsize and the tree
Data
Varying splitsize Synthetic
Cols. Iters. Split Maps Secs. Increasing split size
(MB) improves performance
50 1 64 8000 388 (accounts for Hadoop
– – 256 2000 184 data movement)
– – 512 1000 149
– 2 64 8000 425 Increasing iterations helps
– – 256 2000 220 for problems with many
columns.
– – 512 1000 191
1000 1 512 1000 666 (1000 columns with 64-MB
split size overloaded the
– 2 64 6000 590
single reducer.)
– – 256 2000 432
– – 512 1000 337
44
David Gleich · Purdue
CS&E Seminar
45. MapReduceTSQR summary
MapReduce is great for TSQR!
Data A tall and skinny (TS) matrix by rows
Map QR factorization of local rows Demmel et al. showed that
this construction works to
Reduce QR factorization of local rows compute a QR factorization
with minimal communication
Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute (the norm of each column) 161 sec.
Time to compute in qr( ) 387 sec.
45
On a 64-node Hadoop cluster with · Purdue
CS&E Seminar
David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
46. Our vision!
To enable analysts
and engineers to
hypothesize from " Paul G. Constantine "
data computations Sandia!
Jeremy Templeton
Joe Ruthruff
instead of expensive
… and you ? …
HPC computations.
46
David Gleich · Purdue
CS&E Seminar