1. Big Iron and Parallel Processing
USArray Data Processing Workshop
Scott Teige, PhD
July 30, 2009
2. Overview
• How big is “Big Iron”?
• Where is it, what is it?
• One system, the details
• Parallelism, the way forward
• Scaling and what it means to you
• Programming techniques
• Examples
• Excercises
USArray Data Processing Workshop July 30, 2009
3. What is the TeraGrid?
• “… a nationally distributed
cyberinfrastructure that provides leading
edge computational and data services for
scientific discovery through research and
education…”
• A document exists in your training
account home directories.
USArray Data Processing Workshop July 30, 2009
4. Some TeraGrid Systems
Kraken NICS Cray 608 TF 128 TB
Ranger TACC Sun 579 123
Abe NCSA Dell 89 9.4
Lonestar TACC Dell 62 11.6
Steele Purdue Dell 60 12.4
Queen Bee LONI Dell 50 5.3
Lincoln NCSA Dell 47 3.0
BigRed IU IBM 30 6.0
USArray Data Processing Workshop July 30, 2009
5. System Layout
Kraken 2.30 GHz 66048 cores
Ranger 2.66 62976
Abe 2.33 9600
Lonestar 2.66 5840
Steele 2.33 7144
USArray Data Processing Workshop July 30, 2009
6. Availability
Kraken 608TFLOPS 96% Use 24.3 IdleTF
Ranger 579 91% 52.2
Abe 89 90% 8.9
Lonestar 62 92% 5.0
Steele 60 67% 19.8
Queen Bee 51 95% 2.5
Lincoln 48 4% 45.6
Big Red 31 83% 5.2
USArray Data Processing Workshop July 30, 2009
7. Research Cyberinfrastructure
The Big Picture:
• Compute
Big Red (IBM e1350 Blade Center JS21)
Quarry (IBM e1350 Blade Center HS21)
• Storage
HPSS
GPFS
OpenAFS
Lustre
Lustre/WAN
USArray Data Processing Workshop July 30, 2009
8. High Performance Systems
• Big Red [TeraGrid System]
30 TFLOPS IBM JS21 SuSE Cluster
768 blades/3072 cores: 2.5 GHz PPC 970MP
8GB Memory, 4 cores per blade
Myrinet 2000
LoadLeveler & Moab
• Quarry [Future TeraGrid System]
7 TFLOPS IBM HS21 RHEL Cluster
140 blades/1120 cores: 2.0 GHz Intel Xeon 5335
8GB Memory, 8 cores per blade
1Gb Ethernet (upgrading to 10Gb)
PBS (Torque) & Moab
USArray Data Processing Workshop July 30, 2009
10. Data Capacitor (AKA Lustre)
High Performance Parallel File system
-ca 1.2PB spinning disk
-local and WAN capabilities
SC07 Bandwidth Challenge Winner
-moved 18.2 Gbps across a single
10Gbps link
USArray Data Processing Workshop July 30, 2009
11. HPSS
• High Performance Storage System
• ca. 3 PB tape storage
• 75 TB front-side disk cache
• Ability to mirror data between IUPUI and
IUB campuses
USArray Data Processing Workshop July 30, 2009
12. Serial vs. Parallel
• Calculation • Calculation
• Flow Control • Flow Control
• I/O • I/O
• Synchronization
• Communication
USArray Data Processing Workshop July 30, 2009
13. 1-F 1-F
A Serial
Program
F/N
Amdahl’s Law:
F
S=1/(1-F+F/N)
Special case, F=1
S=N, Ideal Scaling
USArray Data Processing Workshop July 30, 2009
14. Speed for various scaling rules
“Paralyzable Process”
S=Ne -(N-1)/q
“Superlinear Scaling”
S>N
USArray Data Processing Workshop July 30, 2009
15. MPI vs. OpenMP
• MPI code may • OpenMP code
execute across many executes only on the
nodes set of cores sharing
• Entire program is memory
replicated for each • Sections of code may
core (sections may or be parallel or serial
may not execute) • Variables may be
• Variables not shared shared
• Typically requires • Incremental
structural parallelization is easy
modification to code
USArray Data Processing Workshop July 30, 2009
16. Other methods exist:
• Sockets
• Explicit shared memory calls/operations
• Pthreads
• None are recommended
USArray Data Processing Workshop July 30, 2009
17. export OMP_NUM_THREADS=8
icc mp_baby.c -openmp -o mp_baby
./mp_baby
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
int iam = 0, np = 1;
#pragma omp parallel default(shared) private(iam, np)
{
Fork
#if defined (_OPENMP)
np = omp_get_num_threads();
iam = omp_get_thread_num(); …
#endif
printf("Hello from thread %d out of %dn", iam, np);
} Join
}
USArray Data Processing Workshop July 30, 2009
18. PROGRAM DOT_PRODUCT
INTEGER N, CHUNKSIZE, CHUNK, I
PARAMETER (N=100)
PARAMETER (CHUNKSIZE=10)
REAL A(N), B(N), RESULT
! Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = I * 2.0
ENDDO
RESULT= 0.0
CHUNK = CHUNKSIZE
!$OMP PARALLEL DO
!$OMP& DEFAULT(SHARED) PRIVATE(I)
!$OMP& SCHEDULE(STATIC,CHUNK)
!$OMP& REDUCTION(+:RESULT)
Fork
DO I = 1, N
RESULT = RESULT + (A(I) * B(I))
ENDDO
…
!$OMP END PARALLEL DO NOWAIT
Join
PRINT *, 'Final Result= ', RESULT
END
USArray Data Processing Workshop July 30, 2009
19. Synchronization Constructs
• MASTER: block executed only by master
thread
• CRITICAL: block executed by one thread
at a time
• BARRIER: each thread waits until all
threads reach the barrier
• ORDERED: block executed sequentially
by threads
USArray Data Processing Workshop July 30, 2009
20. Data Scope Attribute Clauses
• SHARED: variable is shared across all
threads
• PRIVATE: variable is replicated in each
thread
• DEFAULT: change the default scoping of
all variables in a region
USArray Data Processing Workshop July 30, 2009
21. Some Useful Library routines
• omp_set_num_threads(integer)
• omp_get_num_threads()
• omp_get_max_threads()
• omp_get_thread_num()
• Others are implementation dependent
USArray Data Processing Workshop July 30, 2009
22. OpenMP Advice
• Always explicitly scope variables
• Never branch into/out of a parallel region
• Never put a barrier in an if block
• Quarry is at OpenMP version <3.0, TASK
construct, for example, not there
USArray Data Processing Workshop July 30, 2009
23. Exercise: OpenMP
• The example programs are in ~/OMP_F_examples or
~/OMP_C_examples
• Go to https://computing.llnl.gov/tutorials/openMP/excercise.html
• Skip to step 4, compiler is “icc” or “ifort”
• There is no evaluation form
USArray Data Processing Workshop July 30, 2009
24. #include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int myrank;
int ntasks; Node 1 Node 2 …
int main(int argc, char **argv)
{
/* Initialize MPI */
MPI_Init(&argc, &argv);
/* get number of workers */
MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
/* Find out my identity in the default communicator … …
each task gets a unique rank between 0 and ntasks-1 */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Barrier(MPI_COMM_WORLD);
fprintf(stdout,"Hello from MPI_BABY=%dn",myrank);
MPI_Finalize();
exit(0);
}
USArray Data Processing Workshop July 30, 2009
26. C AUTHOR: Blaise Barney From the man page:
program scatter
include 'mpif.h' MPI_Scatter - Sends data from one task
integer SIZE to all tasks in a group
parameter(SIZE=4) …
integer numtasks, rank, sendcount, recvcount, source, ierr message is split into n equal segments,
real*4 sendbuf(SIZE,SIZE), recvbuf(SIZE) the ith segment is sent to the ith process in the group
C Fortran stores this array in column major order, so the
C scatter will actually scatter columns, not rows.
data sendbuf /1.0, 2.0, 3.0, 4.0,
& 5.0, 6.0, 7.0, 8.0,
& 9.0, 10.0, 11.0, 12.0,
& 13.0, 14.0, 15.0, 16.0 /
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
if (numtasks .eq. SIZE) then
source = 1
sendcount = SIZE
recvcount = SIZE
call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf,
& recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr)
print *, 'rank= ',rank,' Results: ',recvbuf
else
print *, 'Must specify',SIZE,' processors. Terminating.'
endif
call MPI_FINALIZE(ierr)
end
USArray Data Processing Workshop July 30, 2009
27. Some linux tricks to get more information:
man -w MPI
ls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3
MPI_Abort
MPI_Allgather
MPI_Allreduce
MPI_Alltoall
...
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
mpicc --showme
/N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc
-I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include
-pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib
-lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl
-Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64
USArray Data Processing Workshop July 30, 2009
28. MPI cool stuff:
• Bi-directional communication
• Non-blocking communication
• User defined types
• Virtual topologies
USArray Data Processing Workshop July 30, 2009
29. MPI Advice
• Never put a barrier in an if block
• Use care with non-blocking
communication, things can pile up fast
USArray Data Processing Workshop July 30, 2009
30. So, can I use MPI with OpenMP?
• Yes you can; extreme care is advised
• Some implementations of MPI forbid it
• You can get killed by “oversubscription”
real fast, I’ve seen time increase like N2
• But sometimes you must… some fftw
libraries are OMP multithreaded, for
example.
USArray Data Processing Workshop July 30, 2009
31. Exercise: MPI
• Examples are in ~/MPI_F_examples or ~/MPI_C_examples
• Go to https://computing.llnl.gov/tutorials/mpi/exercise.html
• Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal
(serial) compilers are “ifort” and “icc”.
• Compile your code: “make all” (Overrides section 9)
• To run an mpi code: “mpirun –np 8 <exe>” …or…
• “mpirun –np 16 –machinefile <ask me> <exe>”
• Skip section 12
• There is no evaluation form.
USArray Data Processing Workshop July 30, 2009
32. Where were those again?
• https://computing.llnl.gov/tutorials/openMP/excercise.html
• https://computing.llnl.gov/tutorials/mpi/exercise.html
USArray Data Processing Workshop July 30, 2009
33. Acknowledgements
• This material is based upon work supported by the National Science
Foundation under Grant Numbers 0116050 and 0521433. Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the author and do not necessarily reflect the views of the National
Science Foundation (NSF).
• This work was support in part by the Indiana Metabolomics and Cytomics
Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc.
• This work was support in part by the Indiana Genomics Initiative. The Indiana
Genomics Initiative of Indiana University is supported in part by Lilly
Endowment, Inc.
• This work was supported in part by Shared University Research grants from
IBM, Inc. to Indiana University.
USArray Data Processing Workshop July 30, 2009