The era of genomic evaluation has brought the need to
perform computations involving large, dense matrices. Particular tasks are the computation and inversion of the genomic relationship matrix. This paper investigates the suitability of Graphics Processing Units together with highly optimised software libraries for these computations, using blocked algorithms. It is shown that calculations
are readily sped up by parallel processing, using freely available library routines, and that reductions in time by factors of 4 to 5 are achievable even for `consumer' grade graphics cards.
1. Utility of Graphics Processing Units
for dense matrix calculations in
computing and inverting genomic
relationship matrices
Karin Meyer and Bruce Tier
Animal Genetics and Breeding Unit,
University of New England,
Armidale, Australia
AAABG 2013
2. GPUs for GRMs | Introduction
Computing in the genomics age
Requirements for mixed model analyses have changed!
Pre-genomics:
mixed model equations sparse (few non-zero elements)
– inverse of pedigree based relationship matrix (NRM)
concentrated on exploiting sparsity in computations
Now: Genomic relationship matrix (GRM)
dense (large proportion of elements non-zero)
require manipulation of large, dense matrices to
– compute the GRM
– invert the GRM
– predict breeding values in single-step approach
computationally demanding
exploit new hardware and software libraries
– computing in parallel
K. M. | 2 / 16
3. GPUs for GRMs | Introduction
Objectives
A first look to examine scope for
speeding up computations to calculate & invert GRM
using parallel computations
exploiting a Graphics Processing Unit (GPU)
“An excursion into the land of computer gaming,
modern compilers and software libraries”
K. M. | 3 / 16
4. GPUs for GRMs | Introduction
What is a GPU?
Graphical Processing Unit
initially: co-processor for computer gaming
– graphics card
– fast recalculation of pixel values
– remember: i387 math processor for i386
now: GP-GPU adapted to General Purpose computing
– hundreds to thousands of cores
– capable of double precision calculations
– turns desktop PC into personal super-computer
– but: limited memory (currently 6GB max.)
specialised interface for application programming
– CUDA (proprietary of NVIDIA) −→ Fortran interface
– OPENCL (C++ like)
K. M. | 4 / 16
5. GPUs for GRMs | Introduction
Tesla K20X
2688 cores
6 GB RAM
Peak: 3.95/1.31 Tflops (single/double precision)
K. M. | 5 / 16
6. GPUs for GRMs | Calculating the GRM
Memory needed: G = ZZ
Gigabytes – single precision matrices
n Gn×n Zn×s
s = 100K 500K 1000K
5 000 0.1 1.9 9.3 18.6
10 000 0.4 3.7 18.6 37.3
20 000 1.5 7.5 37.3 74.5
30 000 3.4 11.2 55.9 111.8
50 000 9.3 18.6 93.1 186.3
−→ break calculations into blocks
−→ use out-of-core storage
→ parallel computing literature
K. M. | 6 / 16
7. GPUs for GRMs | Calculating the GRM
Blockwise multiplication
All-in-one
=Z
Z
G
G = ZZ
Column-wise division
= + · · · +Z.1 · · · · · · Z.4
Z.1
· · ·
· · ·
Z.4
Z.1Z.1 Z.4Z.4
G =
k
Z.kZ.k
K. M. | 7 / 16
Z: n animals
× s SNPs
G: symmetric
sn(n + 1)/2
flops
8. GPUs for GRMs | Calculating the GRM
Blockwise multiplication - cont’
Row-wise division
=Zi.
Zj. Gij = Zi.Zj.
Row- and columnwise division
=
Gij =
k
ZikZjk
K. M. | 8 / 16
9. GPUs for GRMs | Inverting the GRM
Blockwise inversion
G =
GPP GPC GPT
GCP GCC GCT
GTP GTC GTT
1. Subdivide matrix into blocks
C current −→ choosen size
P previous
T trailing
2. Factor & invert chol(GCC)
3. Adjust GPP & GPC
4. ‘Absorb’ into GTT & GCT
5. Calculate inverse
6. Redefine C, P & T; repeat
Gauss-Jordan Elimination
GCC := chol(GCC)
GCC := G−1
CC
GPC := GPCGCC
GPP := GPP + GPCGPC
GCT := GCCGCT
GTT := GTT − GCTGCT
GPT := GPT − GPCGCT
GCT := −(GCCGCT)
GPC := GPCGCC
GCC := GCCGCC
−→ break matrix products into blocks as needed
K. M. | 9 / 16
10. GPUs for GRMs | Computing environment
Hardware
‘Standard’ multi-core CPU
– Intel I7-3930K
– 3.2 GHz, 12M cache
– 64GB RAM
– 6 cores
GPU
– Nvidia GTX 580
– 512 cores
– 3GB RAM
– capable of double precision calculations
→ high end gaming card, not GP-GPU
K. M. | 10 / 16
11. GPUs for GRMs | Computing environment
Software
Linux (CentOS 6)
gfortran, gcc 4.4.7; Intel composer XE 13.0.1
Intel MKL library 11.0
– BLAS: SSYRK, SGEMM, STRTRI, STRMM
– LAPACK: SPOTRF, SPOTRI
CUDA 5.0
– CUBLAS: CUBLAS_SSYRK, CUBLAS_SGEMM,
CUBLAS_STRTRI, CUBLAS_STRMM
CULA dense R16a
– CULA_SPOTRF, CULA_SPOTRI, CULA_STRTRI
MAGMA 1.3
– MAGMAF_SLAUUM
K. M. | 11 / 16