Presentation by Jonathan Cohen & Mark Berger at Bioinformatics conference July 2013. It covers
- GPU Programming in 10 slides
- GPUs in Bioinformatics
- Porting SeqAn to CUDA
- Resources for developers and bioinformatics professionals
4. CUDA – Programming for Throughput
CPU threads:
Large amount of memory per thread
Full-featured instruction set
1-16 execute simultaneous
CUDA threads:
Lightweight footprint
Full-featured instruction set
10,000 execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
Run few threads,
each one very fast
Run many threads,
each one slow,
=> total throughput high
5. CUDA Kernels: Parallel Threads
A kernel is an array of threads,
executed in parallel
All threads execute the same
code
Each thread has an ID
Select input/output data
Control decisions
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
8. CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
9. CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
10. CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
11. Accelerated Computing
Multi-core plus Many-cores
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
3-10X+ Comp Thruput
7X Memory Bandwidth
5x Energy Efficiency
12. How GPU Acceleration Works
Application Code
+
GPU CPU
5% of Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
13. Hello World in CUDA
__global__
void parallel_hello_world()
{
printf(“Hello, world. This is thread %d, block %d!n”,
threadIdx.x, blockIdx.x);
}
int main()
{
parallel_hello_world<<<128,128>>>();
return 0;
}
> nvcc –o hello_world –arch=sm_30 main.cu
> ./hello_world
Hello, world. This is thread 0, block 0!
Hello, world. This is thread 1, block 0!
...
15. Life Technologies
Ion Proton
3 GPUs per Device
S3229 - GPU Accelerated Signal Processing in Ion Proton
Whole Genome Sequencer
Mohit Gupta ( Life Technologies )
Jakob Siegel ( Life Technologies )
https://registration.gputechconf.com/form/session-listing
16. BGI & NVIDIA
Joint Innovation Lab
SOAP3 Aligner
S3257 - Tackling Big Data in Genomics with GPU
BingQiang Wang (Beijing Genomics Institute)
https://registration.gputechconf.com/form/session-listing
17. CUDASW++
From Bertil Schmidt’s group: http://cudasw.sourceforge.net/homepage.htm
Y. Liu, A. Wirawan, B. Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search
by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.
Performance comparisons on
the Swiss-Prot database.
“On GTX680 (GTX690),
CUDASW++ 3.0 yields an
average performance of 109.4
(169.7) GCUPS, with a
maximum of 119.0 (185.6)
GCUPS.”
18. NVIDIA GPU Life Science Focus
Molecular Dynamics: All codes are available
AMBER, CHARMM, DESMOND, DL_POLY,
GROMACS, LAMMPS, NAMD
Great multi-GPU performance
GPU codes: Abalone, ACEMD, HOOMD-Blue
Focus: scaling to large numbers of GPUs
Quantum Chemistry: key codes ported or optimizing
Active GPU acceleration projects:
VASP, NWChem, Gaussian, GAMESS, ABINIT,
Quantum Espresso, BigDFT, CP2K, GPAW, etc.
GPU code: TeraChem
Analytical and Medical Imaging Instruments
19. NVBIO
A GPU based C++ framework for
High Throughput Sequence Analysis
Short & Long Read Alignment
Variant Calling
Compression
…
Overall Design:
flexibility & customizability – a templated library
parallelism at every level
optimize throughput, server-like design
optimize the whole pipeline, not just a single component
(e.g. including data transfers, SAM, BAM, CRAM I/O, …)
20. A modular library
FM-index
Suffix Trie
Radix Tree
Sorted Dictionary
Edit Distance
Smith-Waterman
Needleman-Wunsch
Gotoh
Banded/Full DP
DP AlignmentTries
Exact Search
Backtracking
Text Search
FASTQ
FASTA
Sequence I/O
SAM
BAM
CRAM
Alignment I/O
HTML report
generators
Support Tools
GPU
CPU
O(1k-10k) threads
O(10-100) threads
21. nvBowtie2 - Real Datasets
speedup 4.3x
alignment rate +0.5%
disagreement 0.002%
Ion Proton
100M x 175bp (8-350) end-to-end
-
speedup 2.4x
alignment rate =
disagreement 0.006%
Illumina Genome Analyzer II
10M x 100bp x 2 end-to-end
ERR161544
speedup 7.6x
alignment rate -0.6%
disagreement 0.03%
Ion Proton
100M x 175bp (8-350) local
-
speedup 2.6x
alignment rate =
disagreement 0.022%
Illumina Genome Analyzer II
10M x 100bp x 2 local
ERR161544
25. 3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
26. GPU Accelerated Libraries
“Drop-in” Acceleration for your Applications
Linear Algebra
FFT, BLAS,
SPARSE, Matrix
Numerical & Math
RAND, Statistics
Data Struct. & AI
Sort, Scan, Zero Sum
Visual Processing
Image & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
27. OpenACC: Open, Simple, Portable
• Open Standard
• Easy, Compiler-Driven Approach
• Portable on GPUs and Xeon Phi
main() {
…
<serial code>
…
#pragma acc kernels
{
<compute intensive code>
}
…
}
Compiler
Hint
CAM-SE Climate
6x Faster on GPU
2x Faster on CPU only
Top Kernel: 50% of Runtime
Available from:
28. GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDA, Anaconda AcceleratePython
GPU.NETC#
R, MATLAB, Mathematica, LabVIEWNumerical analytics
29. Reaching New Developers - CUDA Python
Python Productivity + GPU Performance
Easy to Learn
Powerful Libraries
Popular in New Developers
HPC & Data Analytics
Data from CodeEval.com, based on 100k+ code samples
30. Easiest Way to Learn CUDA
50K
Registered
127
Countries
$$
Learn from the Best
Anywhere, Any Time
It’s Free!
Engage with an Active Community