More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
1. 6.963
IT /
A@M
CUD
9
IAP0
Supercomputing on your desktop:
Programming the next generation of cheap
and massively parallel hardware using CUDA
Lecture 03
Nicolas Pinto (MIT)
CUDA - Basics #2
Tuesday, January 13, 2009
2. During this course,
3
6
for 6.9
ed
adapt
we’ll try to
“ ”
and use existing material ;-)
Tuesday, January 13, 2009
4. 6.963
IT /
A@M
CUD
9
IAP0
Language
Compilation
API
Threading Model
Memory Model
Tuesday, January 13, 2009
5. 6.963
IT /
A@M
CUD
9
IAP0
CUDA
Language
Tuesday, January 13, 2009
6. age
gu
an
L
!quot;#$%&'()*'+%,%-,*./,.'%01,0%)+%+)2)-,3%04%
!
!5!66
$--47+%834.3,22'3+%04%',+)-9%24:'%';)+0)*.%
!
<4&'%04%!quot;#$
! ='++'*+%-',3*)*.%</3:'
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
7. age
gu
an
L
!quot;#$%&'()*'+%,%-,*./,.'%01,0%)+%+)2)-,3%04%
!
!5!66
!quot;#$%&$'()*$'+',,$-%../0/12$.0quot;3$$
!
&241-40-$'+',,5
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
8. age
gu
an
L
!quot;#$%&'()*'+%,%-,*./,.'%01,0%)+%+)2)-,3%04%
!
!5!66
>9*0,<0)<%';0'*+)4*+?
!
! #'<-,3,0)4*%@/,-)()'3+
! A/)-0B)*%C,3),D-'+
! A/)-0B)*%E98'+
! F;'</0)4*%!4*()./3,0)4*
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
9. age
gu
an
L
#'<-+8'< G%&'<-,3,0)4*%+8'<)()'3 5%&'<-,3,0)4*%
!
H/,-)()'3
! $%24&)()'3%,88-)'&%04%&'<-,3,0)4*+%4(?
! C,3),D-'+
! I/*<0)4*+
F;,28-'+?%%!quot;#$%J%&'%&(#J%$%)%*!
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
10. uage
ang
L
!quot;#$%/+'+%01'%(4--47)*.%&'<-,3,0)4*%
!
H/,-)()'3+%(43%:,3),D-'+?
++,&-*!&++
!
++$.)(&,++
!
++!quot;#$%)#%++
!
K*-9%,88-9%04%.-4D,-%:,3),D-'+
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
11. age
gu
an
L
!quot;#$%"'()*%)(%(+$,-%$(.%&/%-$quot;(/'('),"0(,1(
!
)*quot;(0quot;./#quot;
2*quot;(0%)%("'/0quot;'(/1(+$,-%$(3quot;3,&4
!
5%'($/6quot;)/3quot;(,6()*quot;(quot;1)/"(%77$/#%)/,1
!
8##quot;''/-$quot;(),(%$$(9:;()*"%0'
!
8##quot;''/-$quot;(),()*quot;(<:;(./%(8:=
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
12. age
gu
an
L
!quot;#$%"'()*%)(%(+$,-%$(.%&/%-$quot;(/'('),"0(,1(
!
)*quot;(0quot;./#quot;
2*quot;(0%)%("'/0quot;'(/1('*%"0(3quot;3,&4
!
5%'($/6quot;)/3quot;(,6()*quot;()*"%0(-$,#>
!
8##quot;''/-$quot;(),(%$$()*"%0'?(,1quot;(#,74(7quot;&()*"%0(
!
-$,#>
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
13. age
gu
an
L
=6(1,)(0quot;#$%"0(%'(!quot;#$%&#'?("%0'(6&,3(
!
0/66quot;"1)()*"%0'(%"(1,)(./'/-$quot;(@1$quot;''(%(
'41#*&,1/A%)/,1(-%&&/quot;&(@'quot;0
B,)(%##quot;''/-$quot;(6&,3(<:;
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
14. age
gu
an
L
!quot;#$%"'()*%)(%(+$,-%$(.%&/%-$quot;(/'('),"0(,1(
!
)*quot;(0quot;./#quot;
2*quot;(0%)%("'/0quot;'(/1(#,1')%1)(3quot;3,&4
!
5%'($/6quot;)/3quot;(,6(quot;1)/"(%77$/#%)/,1
!
8##quot;''/-$quot;(),(%$$(9:;()*"%0'(C"%0(,1$4D
!
8##quot;''/-$quot;(),(<:;(./%(8:=(C"%0EF&/)quot;D
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
15. uage
ang
L
<;!8(@'quot;'()*quot;(6,$$,F/1+(0quot;#$'7quot;#' 6,&(
!
.%&/%-$quot;'G
(()'!&*'((
!
((+quot;,%((
!
((-#quot;.$#((
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
16. age
gu
an
L
!quot;#$%"'()*%)(%(6@1#)/,1(/'(#,37/$quot;0(),?(%10(
!
quot;Hquot;#@)quot;'(,1()*quot;(0quot;./#quot;
<%$$%-$quot;(,1$4(6&,3(%1,)*quot;&(6@1#)/,1(,1()*quot;(
!
0quot;./#quot;
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
17. age
gu
an
L
!quot;#$%"'()*%)(%(+,-#)./-(.'(#/01.$quot;2()/(%-2(
!
quot;3quot;#,)quot;'(/-()*quot;(*/')
4%$$%5$quot;(/-$6(+&/0(%-/)*quot;&()*quot;(*/')
!
7,-#)./-'(8.)*/,)(%-6(49!:(2quot;#$'1quot;# %"(
!
*/')(56(2quot;+%,$)
4%-(,'quot;(!!quot;#$%!! %-2(!!&'()*'!!+
!
)/;quot;)*quot;&
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
18. age
gu
an
L
!quot;#$%"'()*%)(%(+,-#)./-(.'(#/01.$quot;2()/(%-2(
!
quot;3quot;#,)quot;'(/-()*quot;(2quot;<.#quot;
4%$$%5$quot;(+&/0()*quot;(*/')
!
9'quot;2(%'()*quot;(quot;-)&6(1/.-)(+&/0(*/')()/(2quot;<.#quot;
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
19. age
gu
an
L
49!:(1&/<.2quot;'(%('quot;)(/+(5,.$)=.-(<quot;#)/&()61quot;'>
!
*quot;,-./+0*quot;,-./+*quot;,-1/+0*quot;,-1/+*quot;,-2/+
!
0*quot;,-2/+*quot;,-3/+0*quot;,-3/+
$quot;#-%./+0$quot;#-%./+$quot;#-%1/+0$quot;#-%1/+
!
$quot;#-%2/+0$quot;#-%2/+$quot;#-%3/+0$quot;#-%3/
)4%./+0)4%./+)4%1/+0)4%1/+)4%2/+
!
0)4%2/+)4%3/+0)4%3/+
5#46./+05#46./+5#461/+05#461/+5#462/+
!
05#462/+5#463/+05#463/+
75#,%./+75#,%1/+75#,%2/+75#,%3+
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
20. age
gu
an
L
4%-(#/-')&,#)(%(<quot;#)/&()61quot;(8.)*('1quot;#.%$(
!
+,-#)./->
8,9'!!quot;#$%&'(%):(;/+(.!quot;#$
4%-(%##quot;''(quot;$quot;0quot;-)'(/+(%(<quot;#)/&()61quot;(8.)*(
!
!quot;#$%&!quot;'$%&!quot;($%&!quot;)$*
('*(,-<=
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
21. age
gu
an
L
&)82 .'(%('1quot;#.%$(<quot;#)/&()61quot;
!
?%0quot;(%'(0)4%2@(quot;3#quot;1)(#%-(5quot;(#/-')&,#)quot;2(
!
+&/0(%('#%$%&()/(+/&0(%(<quot;#)/&>
:$*,5,-/+./+.>
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
22. age
gu
an
L
49!:(1&/<.2quot;'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$quot;'
!
%quot;-',&?&=@(@5#*9?&=@(@5#*9A)8@(
!
6-)&A)8
+',-.&/0&/&1&)822&34&10)4%22&
!
:##quot;''.5$quot;(/-$6(+&/0(2quot;<.#quot;(#/2quot;
!
4%--/)()%Aquot;(%22"''
!
4%--/)(%''.;-(<%$,quot;
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
23. uage
ang
L
!quot;#$%&'()*+,-%-./0120*2%-341'%0(%513/26%06,%
!
,7,230*(/%(8%9,'/,5-
!quot;#$%%%&'()*(+,-./0$1*(+!!!quot;#$%&'()*+,-./
!quot;#$%%%&'()*(+,-./0$1*(+!!!quot;#$%&'()*+,-./
!quot;#$%%%&'()*(+,-./0$1*(+!!!quot;#$%&'()*+,-./
!quot;#$ *-%1%%%&'()*'%%+83/20*(/
!
@6,%2(>&*5,'%03'/-%06*-%0.&,%(8%-010,>,/0%
!
*/0(%1%=5(29%(8%2(+,%0610%2(/8*43',-A%1/+%
513/26,-%06,%9,'/,5
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
24. age
gu
an
L
!quot;#$%+,8*/,-%1%51/4314,%0610%*-%-*>*51'%0(%
!
!B!CC
D>&('01/0%#*88,',/2,-E
!
! F3/0*>,%G*='1'.
! H3/20*(/-
! !51--,-A%I0'320-A%quot;/*(/-
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
25. uage
ang
L
!quot;#$%&'(&%)*+,%quot;+%&'$%#$-./$%/(+0&%*,$%+quot;)1(2%
!
!quot;!##$%&'()*+$,)-./.0$1&'2()3'4
53$!quot;#$%&6$"'()6$*(++,-6$+(2
!
734($*/(8$1&'2()3'4$8/9+$:+9)2+$+;&)9/<+'(
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
26. uage
ang
L
J'$/$!DFG$:+9)2+6$(8+.+$)4$'3$4(/2K
!
L0$:+1/&<(6$/<<$1&'2()3'$2/<<4$/.+$)'<)'+:
!
!/'$&4+$!!quot;#$quot;%$quot;&!! (3$>.+9+'($M!DFG$HIHN
!
G<<$<32/<$9/.)/-<+46$1&'2()3'$/.E&*+'(4$/.+$
!
4(3.+:$)'$.+E)4(+.4
! '( 1&'2()3'$.+2&.4)3'
53$1&'2()3'$>3)'(+.4
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
27. uage
ang
L
!DFG$4&>>3.(4$43*+$!##$1+/(&.+4$13.$:+9)2+$
!
23:+I$$OIE?
! =+*></(+$1&'2()3'4
!</44+4$/.+$4&>>3.(+:$)'4):+$I2&$43&.2+6$-&($
!
*&4($-+$834($3'<0
P(.&2(4quot;D')3'4$A3.K$3'$:+9)2+$23:+$/4$>+.$!
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
28. Common Runtime Component:
age
angu
Mathematical Functions L
• pow, sqrt, cbrt, hypot
• exp, exp2, expm1
• log, log2, log10, log1p
• sin, cos, tan, asin, acos, atan, atan2
• sinh, cosh, tanh, asinh, acosh, atanh
• ceil, floor, trunc, round
• Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
16
!quot;#$%&'quot;(&)*+,-.#./quot;$0'quot;120342"15quot;678
9):$0$;quot;.<<&0=&>;quot;/8?8>@quot;AB3CC;quot;CDDB
Tuesday, January 13, 2009
29. Device Runtime Component:
uage
ang
Mathematical Functions L
• Some mathematical functions (e.g. sin(x))
have a less accurate, but faster device-only
version (e.g. __sin(x))
– __pow
– __log, __log2, __log10
– __exp
– __sin, __cos, __tan
17
!quot;#$%&'quot;(&)*+,-.#./quot;$0'quot;120342"15quot;678
9):$0$;quot;.<<&0=&>;quot;/8?8>@quot;AB3CC;quot;CDDB
Tuesday, January 13, 2009
30. 6.963
IT /
A@M
CUD
9
IAP0
CUDA
Compilation
Tuesday, January 13, 2009
31. tion
pila
m
Co
!quot;#$%&'()*+%,-.+&%+/0%-/%12*(3
!
!quot;#$%&#'%'(&)'quot;*'+,-&.,'%#+'/quot;0$'.quot;+,1+%$%
!
!quot;(2&3,+'45'!quot;##
!
!quot;## &0'6,%335'%'76%22,6'%6quot;8#+'%'(quot;6,'
!
.quot;(23,)'.quot;(2&3%$"#'26quot;.,00
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
32. tion
pila
m
Co
!quot;#$%
! 9quot;6(%3':.;':.22 0quot;86.,'*&3,0
! !<=>':.8'0quot;86.,'.quot;+,'*&3,0
&$%#$%
! ?4@,.$1,),.8$%43,'.quot;+,'*quot;6'/quot;0$
! :.84&# ,),.8$%43,'.quot;+,'*quot;6'$/,'+,-&.,
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
33. tion
pila
m
Co
Aquot;6':.'%#+':.22 *&3,0;'#-.. &#-quot;B,0'$/,'#%$&-,'
!
!1!CC'.quot;(2&3,6'*quot;6'$/,'050$,('D,EF'E..1.3G
4')%2*(%,-.+&5%-6%-&%7%.-66.+%8')+%*'89.-*76+0:
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
34. tion
pila
m
Co
'($
.22
'($
'( '*
.8+%*, .22 3&#B,6
'.%$,'(
'*
.22 3&#B,6
')#$'(
'#%+ '($,-quot;
#-quot;2,#.. 2$)%0 .84&#
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
35. tion
pila
m
Co
Hquot;'0,,'$/,'0$,20'2,6*quot;6(,+'45'#-..;'80,'$/,'
!
//0121$quot; %#+'//344#5.quot;((%#+'3&#,'quot;2$"#0
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
36. tion
pila
m
Co
!quot;#$%&$'$()*+%, -%,./0$#%12$12/$3/&1$quot;4$12/$
!
53quot;63'78
9',$+/:
!
! ;quot;'0/0$'&$'$4%-/$'1$3*,1%7/
! <7+/00/0$%,$0'1'$&/67/,1
! <7+/00/0$'&$'$3/"*3)/
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
37. tion
pila
m
Co
!quot;#$quot;%&&'($)'*)+,(-),(.'/0
!
! =2/$53quot;63'7$)3'&2/&
! >1$53quot;0*)/&$12/$#3quot;,6$3/&*-1
!0
?*1@$12/3/$'3/$7',A$0/+*66%,6$1/)2,%B*/&
!
! C/+*66%,6$"41#'3/$D/6:$60+@$E%&*'-$F1*0%quot;G
! !quot;#$%&
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
39. mu
E
!quot;#$%&'(#)#*+,-&./0#1.)(2#quot;+$#*)'#/,$.)3/#4quot;#
!
0$''&'(#!quot;quot; *+5/#+'#36/#6+%3
! 7+,-&./0#8.)(9 ##$%&'(%#%)*quot;!+',-
:++5#1+0#,+%3#5/4$((&'(9#*)'#$%/#(54;-0&'31
!
<+3#)#30$/#/,$.)3&+'9
!
! =)*/#7+'5&3&+'%2#>/,+0quot;#,+5/.#5&11/0/'*/%2#/3*
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
40. mu
E
Device Emulation Mode Pitfalls
• Emulated device threads execute sequentially,
so simultaneous accesses of the same memory
location by multiple threads could produce
different results.
• Dereferencing device pointers on the host or host
pointers on the device can produce correct
results in device emulation mode, but will
generate an error in device execution mode
!quot;#$%&'quot;(&)*+,-.#./quot;$0'quot;120342"15quot;678
9):$0$;quot;.<<&0=&>;quot;/8?8>@quot;AB3CC;quot;CDDB
Tuesday, January 13, 2009
41. mu
E
Floating Point
• Results of floating-point computations will slightly
differ because of:
– Different compiler outputs, instruction sets
– Use of extended precision for intermediate results
• There are various options to force strict single precision on
the host
!quot;#$%&'quot;(&)*+,-.#./quot;$0'quot;120342"15quot;678
9):$0$;quot;.<<&0=&>;quot;/8?8>@quot;AB3CC;quot;CDDB
Tuesday, January 13, 2009
42. lkit
oo
T
CUDA Toolkit
Application Software
Industry Standard C Language
Libraries
!quot;%&'(
!quot;##$ !quot;)**
CUDA Compiler CUDA Tools
GPU:card, system
+ !quot;#$#%& '()*++(#,,*-./01-
Multicore CPU
4 cores
3
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
43. lkit
oo
T
CUDA Many-core + Multi-core support
C CUDA Application
NVCC
NVCC
--multicore
Many-core Multi-core
PTX code CPU C code
PTX to Target gcc and
Compiler MSVC
Many-core Multi-core
5
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
44. lkit
oo
T
CUDA Compiler: nvcc
Any source file containing CUDA language extensions (.cu)
must be compiled with nvcc
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
NVCC can output:
Either C code (CPU Code)
That must then be compiled with the rest of the application using another tool
Or PTX or object code directly
An executable with CUDA code requires:
The CUDA core library (cuda)
The CUDA runtime library (cudart)
6
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
45. lkit
oo
T
CUDA Compiler: nvcc
Important flags:
-arch sm_13 Enable double precision ( on
compatible hardware)
-G Enable debug for device code
--ptxas-options=-v Show register and memory usage
--maxrregcount <N> Limit the number of registers
-use_fast_math Use fast math library
7
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
46. lkit
oo
T
Compiling CUDA for Multi-Core
Using “—multicore” compile
C/C++ CUDA
switch with the NVCC
Application
compiler generates C code
for multi-core CPU
NVCC --multicore Performance scales linearly
with more cores
Multicore CPU C Code
Control numbers of cores
with environment variable
CUDA_NROF_CORES=n
gcc / MSVC
Multicore Optimized Application
8
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
47. lkit
oo
T
GPU Tools
Profiler
Available now for all supported OSs
Command-line or GUI
Sampling signals on GPU for:
Memory access parameters
Execution (serialization, divergence)
Debugger
Runs on the GPU
Emulation mode
Compile and execute in emulation on CPU
Allows CPU-style debugging in GPU source
35
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
48. 6.963
IT /
A@M
CUD
9
IAP0
CUDA
API
Tuesday, January 13, 2009
49. PI
A
!Aquot;(DGHI(IMK(71/'.'$'(19($A"quot;(B*&$'2
!
! !Aquot;(A1'$(IMK
! !Aquot;(-quot;F.7quot;(IMK
! !Aquot;(71))1/(IMK
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
50. PI
A
!quot;#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08
!
! '#127#$9:6:;#9#6,
! <#9*0=$9:6:;#9#6,
! >,0#:9$9:6:;#9#6,
! ?1#6,$9:6:;#9#6,
! !#@,50#$9:6:;9#6,
! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
51. PI
A
!quot;#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$
!
!quot;#$%&
! !quot;#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J
! !quot;#$quot;2;quot;$G#1#G$K56,29#$(-.$I/0#42@8$753:J
>*9#$,quot;26;+$7:6$F#$3*6#$,quot;0*5;quot;$F*,quot;$(-.+L$
!
*,quot;#0+$:0#$+/#72:G2M#3
! %:6$F#$92@#3$,*;#,quot;#0$IH2,quot;$7:0#J
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
52. PI
A
(GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#
!
!*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$
!
,quot;#$quot;:03H:0#L$H#$6##3$:$!quot;#$%quot;&%'()quot;*)
'#127#$7*6,#@,+$:0#$F*563$N8N$H2,quot;$quot;*+,$
!
,quot;0#:3+$IO5+,$G2P#$A/#6BCQJ
! >*L$#:7quot;$quot;*+,$,quot;0#:3$9:=$quot;:1#$:,$9*+,$*6#$3#127#$
7*6,#@,
! (63L$#:7quot;$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$
*6#$quot;*+,$,quot;0#:3
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
53. PI
A
(GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
!
7*3#$*4$,=/#8$+,-quot;./0)
! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
7*3#$*4$,=/#$%/!12--'-3)
(6$26,#;#0$1:G5#$H2,quot;$M#0*$R$6*$#00*0
!
%/!14quot;)51.)2--'-L$%/!14quot;)2--'-6)-$(7
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
54. PI
A
K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#
!
'#127#$(-.$7:GG+$95+,$7:GG$%/8($)
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
55. PI
A
!quot;#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,quot;#$
!
:1:2G:FG#$3#127#+
%/9quot;#$%quot;4quot;)+'/()
!
%/9quot;#$%quot;4quot;)
!
%/9quot;#$%quot;4quot;):1;quot;
!
%/9quot;#$%quot;4quot;)<')10=quot;;'->
!
%/9quot;#$%quot;4quot;)?))-$@/)quot;
!
!
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
56. PI
A
!quot;#$%&$%#'(()$%*%+$,-#$%&-.'%!quot;#$%&!$'$(
!
&$%/$.%*%+$,-#$%'*quot;+0$%(1%.23$%)*+$%&!$
4*quot;%quot;(&%#5$*.$%*%#(quot;.$6.%&-.'%!quot;)(,)-$.($
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
57. PI
A
78quot;.-9$%:;<%35(,-+$)%*%)-930-1-$+%-quot;.$51*#$%
!
1(5%#5$*.-quot;/%*%#(quot;.$6.=
!quot;+.'$(#$%&!$)/quot;0(
!
!quot;+.1$(#$%&!$
!
:quot;+%.'$%8)$180=
!
!quot;+.)2//3$#$%&!$
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
58. PI
A
Device Management
CPU can query and select GPU devices
cudaGetDeviceCount( int* count )
cudaSetDevice( int device )
cudaGetDevice( int *current_device )
cudaGetDeviceProperties( cudaDeviceProp* prop,
int device )
cudaChooseDevice( int *device, cudaDeviceProp* prop )
Multi-GPU setup:
device 0 is used by default
one CPU thread can control one GPU
multiple CPU threads can control the same GPU
– calls are serialized by the driver
28
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
59. PI
A
!quot;#$%&$%'*,$%*%#(quot;.$6.%>)*!/0($,(?%#*quot;%
!
*00(#*.$%9$9(52@%#*00%*%A;B%18quot;#.-(quot;%$.#C%%
! 4(quot;.$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-quot;/%
.'5$*+
D(%)2quot;#'5(quot;-E$%*00%.'5$*+)%>4;B%'().%&-.'%
!
A;B%.'5$*+)?%#*00%!quot;)(,140!2-/0&5$
! F*-.)%1(5%*00%A;B%.*)G)%.(%1-quot;-)'%
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
60. PI
A
:00(#*.$HI5$$%9$9(52=
!
!quot;6$7899/!:;!quot;6$7<-$$
!
<quot;-.-*0-E$%9$9(52=
!
!quot;6$73$(
!
4(32%9$9(52=
!
!quot;6$7!=4>(/#:;!quot;6$7!=4#(/>:;
!
!quot;6$7!=4#(/#
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
61. PI
A
F'$quot;%*00(#*.-quot;/%9$9(52%1(5%.'$%2/3(@%#*quot;%
!
8)$%!quot;##$% H%&'( H%!!quot;)
! !5%8)$%!quot;6$7899/!>/3(@%!quot;6$7<-$$>/3(
D'$)$%18quot;#.-(quot;)%*00(#*.$%'().%9$9(52%.'*.%-)%
!
)quot;*'+#$%,'-
;$51(59*quot;#$%-935(,$+%1(5%#(32%.(H15(9%
!
3*/$J0(#G$+%'().%9$9(52
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
62. PI
A
:00(#*.$HI5$$%9$9(52=
!
!quot;+.6.99/!@%!quot;+.<-$$
!
<quot;-.-*0-E$%9$9(52=
!
!quot;+.6$73$(
!
4(32%9$9(52=
!
!quot;+.6$7!=4
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
63. PI
A
!quot;#$%&''(!!quot;#$%quot;&''(%&$#quot;!quot;#$%& )#)(*+
!
,&-quot;&'.(quot;&''(%&$#quot;%&&%' )#)(*+quot;/012
!
3**&+.quot;&*#quot;%*#&$#4quot;56$7quot;".8#%696%quot;564$7quot;&-4quot;
!
7#6:7$quot;&-4quot;#'#)#-$quot;$+8#
! ;#)(*+quot;'&+(<$quot;6.quot;(8$6)6=#4quot;/#>:>quot;8&%?6-:2quot;@+quot;
*<-$6)#
!quot;&))*+,)$*-$! !quot;&))*+.$/-)(+
!
!quot;#$%!0+.-(&! !quot;#$%!0+1-(&!quot;#
!
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
64. PI
A
3quot;)(4<'#quot;6.quot;"@'(@quot;(9quot;ABCquot;%(4#D4&$"&'(-:quot;
!
56$7quot;.()#quot;$+8#quot;6-9(*)&$6(-
! >%<@6- 96'#.
3quot;)(4<'#quot;6.quot;%*#&$#4quot;@+quot;'(&46-:quot;"%<@6- 56$7quot;
!
!quot;#(2quot;'$,)$*-$ (*quot;!quot;#(2quot;'$3(*2.*-*
;(4<'#quot;%&-quot;@#quot;<-'(&4#4quot;56$7quot;
!
!quot;#(2quot;'$45'(*2
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
65. PI
A
E(&46-:quot;")(4<'#quot;&'.(quot;%(86#.quot;6$quot;$(quot;$7#quot;4#F6%#
!
,&-quot;$7#-quot;:#$quot;$7#quot;&44*#..quot;(9quot;9<-%$6(-.quot;&-4quot;
!
:'(@&'quot;F&*6&@'#.G
!quot;#(2quot;'$6$-7quot;5!-8(5
!quot;#(2quot;'$6$-6'(9*'
!quot;#(2quot;'$6$-:$;<$=
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
66. PI
A
H-%#quot;")(4<'#quot;6.quot;'(&4#4!quot;&-4quot;5#quot;7&F#quot;"
!
9<-%$6(-quot;8(6-$#*!quot;5#quot;%&-quot;%&''quot;"9<-%$6(-
I#quot;)<.$quot;.#$<8quot;$7#quot;!quot;!#$%&'()!(*&+'(,!(%)
!
96*.$
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
67. PI
A
JK#%<$6(-quot;#-F6*(-)#-$quot;6-%'<4#.G
!
quot; L7*#&4quot;M'(%?quot;N6=#
quot; N7&*#4quot;;#)(*+quot;N6=#
quot; O<-%$6(-quot;B&*&)#$#*.
quot; A*64quot;N6=#
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
68. PI
A
L7*#&4quot;M'(%?quot;N6=#Gquot;
!
!quot;7quot;5!>$-?'(!@>A*0$
N7&*#4quot;;#)(*+quot;N6=#G
!
!quot;7quot;5!>$->A*)$2>8B$
O<-%$6(-quot;B&*&)#$#*.G
!
!quot;C*)*%>$->8B$DE!quot;C*)*%>$-8DE
!quot;C*)*%>$-=DE!quot;C*)*%>$-F
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
69. PI
A
!quot;#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%
!
./01*#20%#0321+*#204
!quot;#$quot;%!&'()*
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
70. PI
A
+,!$--. !quot;#$%&#'()*+,#-*#%.quot;#&+*/#01*#2223444#
!
'"%0(5quot;#(quot;65%.0(5quot;#758*9.059:
5,(%12-6#7(quot;%8(0(quot;+*()%1+77)%*2%+77%$(3#1(%9:;%
!
*2%)(*/6%*,(%(<(1/*#20%(03#quot;20-(0*
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
71. PI
A
9%)*quot;(+-%#)%+%)(=/(01(%2.%26(quot;+*#20)%*,+*%
!
211/quot;%#0%2quot;$(quot;%%>?8?
@? A26B%$+*+%.quot;2-%,2)*%*2%$(3#1(
C? ><(1/*(%$(3#1(%./01*#20%
D? A26B%$+*+%.quot;2-%$(3#1(%*2%,2)*
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
72. PI
A
9%)*quot;(+-%#)%+%)(=/(01(%2.%26(quot;+*#20)%*,+*%
!
211/quot;%#0%2quot;$(quot;
E#..(quot;(0*%)*quot;(+-)%1+0%F(%/)($%*2%-+0+8(%
!
1201/quot;quot;(01B%%>?8?
G3(quot;7+66#08%-(-2quot;B%126B%.quot;2-%20(%)*quot;(+-%
H#*,%*,(%./01*#20%(<(1/*#20%.quot;2-%+02*,(quot;
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
73. PI
A
<=quot;4$;',"','8,J'3D'+quot;$quot;&:2424H'$Fquot;'E&3H";;'
!
3D',';$",:
! !quot;#$%&'()*quot;+,#'-'./-)0#)1'+$'-'&%)#-/'-%'-'
;Equot;#2D2#'E3;2$234
! -'F3O+quot;&'3D',4'quot;=quot;4$'F,4+Oquot;'#,4)
! P,2$'D3&',4'quot;=quot;4$'$3'3##%&
! Qquot;,;%"'$Fquot;'$2:quot;'$F,$'3##%&"+'Nquot;$8quot;quot;4'$83'
quot;=quot;4$;
!quot;#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Tuesday, January 13, 2009
74. 6.963
IT /
A@M
CUD
9
IAP0
CUDA
Execution and Threading Model
Tuesday, January 13, 2009
76. ding
hrea
T
CUDA Uses Extensive Multithreading
• CUDA threads express fine-grained data parallelism
– Map threads to GPU threads or CPU vector elements
– Virtualize the processors
– You must rethink your algorithms to be aggressively parallel
• CUDA thread blocks express coarse-grained parallelism
– Map blocks to GPU thread arrays or CPU threads
– Scale transparently to any number of processors
• GPUs execute thousands of lightweight threads
– One DX10 graphics thread computes one pixel fragment
– One CUDA thread computes one result (or several results)
– Provide hardware multithreading & zero-overhead scheduling
9
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
77. ing
ead
hr
T
CUDA Programming Model
Parallel code (kernel) is launched and executed on a
device by many threads
Threads are grouped into thread blocks
Parallel code is written for a thread
Each thread is free to execute a unique code path
Built-in thread and block ID variables
4
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
78. ing
ead
hr
T
Thread Hierarchy
Threads launched for a parallel section are
partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory
5
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
79. ing
ead
hr
T
IDs and Dimensions
Threads:
Device
3D IDs, unique within a block
Grid 1
Blocks: Block Block Block
2D IDs, unique within a grid (0, 0) (1, 0) (2, 0)
Dimensions set at launch time Block Block Block
(0, 1) (1, 1) (2, 1)
Can be unique for each section
Built-in variables: Block (1, 1)
threadIdx, blockIdx
Thread Thread Thread Thread Thread
blockDim, gridDim (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
6
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
81. ing
ead
hr
T
Blocks must be independent
Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption
can run in any order
can run concurrently OR sequentially
Blocks may coordinate but not synchronize
shared queue pointer: OK
shared lock: BAD … can easily deadlock
Independence requirement gives scalability
10
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
82. ing
ead
hr
T
Hardware Multithreading
Hardware allocates resources to blocks
blocks need: thread slots, registers, shared
SM
memory
MT IU
blocks don’t run until resources are available
SP
Hardware schedules threads
threads have their own registers
any thread not waiting for something can run
context switching is free – every cycle
Hardware relies on threads to hide latency
Shared
Memory
i.e., parallelism is necessary for performance
39
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009
83. ing
ead
hr
T
SIMT Thread Execution
Groups of 32 threads formed into warps
always executing same instruction
SM
shared instruction fetch/dispatch
MT IU some become inactive when code path diverges
hardware automatically handles divergence
SP
Warps are the primitive unit of scheduling
SIMT execution is an implementation choice
sharing control logic leaves more space for ALUs
largely invisible to programmer
Shared
must understand for performance, not correctness
Memory
40
M02: High Performance Computing with CUDA
Tuesday, January 13, 2009