SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0

                           Supercomputing on your desktop:
         Programming the next generation of cheap
        and massively parallel hardware using CUDA

                                                             Lecture 07
                                                             Nicolas Pinto (MIT)




                                                                  #2
                                 CUDA              -   Advanced
Friday, January 23, 2009
During this course,
                                               3
                                              6
                                       for 6.9
                                    ed
                               adapt


        we’ll try to


                           “                       ”

            and use existing material ;-)
Friday, January 23, 2009
Today
                           yey!!




Friday, January 23, 2009
Wanna Play with
                           The Big Guys?




Friday, January 23, 2009
Here are the keys
                to High-Performance in CUDA




Friday, January 23, 2009
ng!
                                                                               rni
                                                                            Wa
               To optimize or not to optimize



                      Hoare said (and Knuth restated)



                           “Premature optimization is the root of all evil.”




                                                                               slide by Johan Seland
                                                      Applied Mathematics                        23/53



Friday, January 23, 2009
ng!
                                                                                rni
                                                                             Wa
               To optimize or not to optimize



                      Hoare said (and Knuth restated)
                             “We should forget about small efficiencies, say about
                             97% of the time:
                             Premature optimization is the root of all evil.”


                                                        ⇓
                           3% of the time we really should worry about small efficiencies
                                              (Every 33rd codeline)




                                                                                   slide by Johan Seland
                                                       Applied Mathematics                           23/53



Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0




                         Strategy
           Memory Optimizations
          Execution Optimizations



Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0



                                                   CUDA
                           Performance Strategies




Friday, January 23, 2009
egy
                                                                               rat
                                                                            St
               Optimization goals



                           We should strive to reach GPU performance
                           We must know the GPU performance
                               Vendor specifications
                               Syntetic benchmarks
                           Choose a performance metric
                               Memory bandwidth or GFLOPS?
                           Use clock() to measure
                           Experiment and profile!




                                                                               slide by Johan Seland
                                                      Applied Mathematics                        25/53



Friday, January 23, 2009
ing
                                                                                                ead
                                                                                              hr
                                                                                    T
                   Programming Model
                                                             Host            Device
                            A kernel is executed as a                              Grid 1
                            grid of thread blocks                                    Block     Block    Block
                                                                Kernel
                            A thread block is a batch                                (0, 0)    (1, 0)   (2, 0)
                                                                  1

                            of threads that can                                      Block     Block    Block
                            cooperate with each                                      (0, 1)    (1, 1)   (2, 1)

                            other by:
                                                                                     Grid 2
                                      Sharing data through
                                      shared memory            Kernel
                                                                 2
                                      Synchronizing their
                                      execution
                                                                    Block (1, 1)


                            Threads from different
                                                                     Thread Thread Thread Thread Thread
                                                                      (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

                            blocks cannot cooperate                  Thread Thread Thread Thread Thread
                                                                      (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

                                                                     Thread Thread Thread Thread Thread
                                                                      (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

                                                                                                    3
                   © NVIDIA Corporation 2006




Friday, January 23, 2009
mory
                                             Me
           Data Movement in a CUDA Program

           Host Memory
            Device Memory
             [Shared Memory]
               COMPUTATION
             [Shared Memory]
            Device Memory
           Host Memory




           © NVIDIA Corporation 2008             10


Friday, January 23, 2009
erf
                                                                           P
                   !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123

                           456$%$&'($78'quot;'78'7#(quot;5-5**'*$/%

                           456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

                           @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.'
                              123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:


                           E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
                           85#5(#-57/0'-/
                              GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='(
                              05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#




             39


Friday, January 23, 2009
erf
                                                                        P
                   !quot;#$%$&'()'%*+,(-*.'+'/0'

                           -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
                              =2*>12?@*012(4'5$0'(%'%*+,(


                           !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'(
                           %'%*+,

                           B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3




             40


Friday, January 23, 2009
erf
                                                                            P
                   !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03

                           45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03
                           !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03

                           <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;%
                           6/quot;0$'%93%quot;88%*/0$quot;'6

                           <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66
                              .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)?
                              :,quot;8$6:$quot;98$%quot;''0$667)+
                              1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0




             41


Friday, January 23, 2009
erf
                                                                      P
                   !quot;#$%&'&((#()quot;*$+,,)-)#./(0

                           %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
                           *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0
                              9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot;


                           <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$
                           *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1'
                              @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0




             42


Friday, January 23, 2009
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0


                                   Memory
                              Optimizations



Friday, January 23, 2009
ory
                                                       em
                                                     M
                   !quot;#$%&'$()*#*+,)*$-.

                           /()*#*+*-0'#quot;#$%&')%,-.1quot;%.
                           2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;.
                           6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3&




             44


Friday, January 23, 2009
ory
                                                                     em
                                                                   M
                   !quot;#quot;$%&quot;'()*&(

                           !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$
                           6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1
                              789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE
                              G89:($)/&$>?@A*$MFH


                           N,',.,O*$#&quot;'()*&(
                              @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$
                              /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
                              .*./&0


                           8&/5;$#&quot;'()*&(
                              R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*(



             45


Friday, January 23, 2009
ory
                                                                          em
                                                                        M
                   !quot;#$%&'()$*+,$-'./+0.quot;123$.2

                           (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+
                           ='27+-$-'./
                           >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($
                              @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
                              LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+
                              8'Q$.(5'()$*+!GH%$9

                           R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$

                           U2$+:;7=+(quot;47;'1
                              W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+
                              'Q$.quot;55+2/27$-+<$.3'.-quot;1($
                              0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72



             46


Friday, January 23, 2009
em
                                                                     gm
                   !quot;#$%quot;&'()#*+&,(%-./0*12(.

                           3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(.

                           ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7>

                           B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67(
                              D#%quot;(.71649&8@&2#&E;F&.@((-8@
                              ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@




             47


Friday, January 23, 2009
em
                                                                              gm
               Accessing global memory




                           4 cycles to issue on memory fetch
                           but 400-600 cycles of latency
                               The equivalent of 100 MADs
                           Likely to be a performance bottleneck
                           Order of magnitude speedups possible
                               Coalesce memory access
                           Use shared memory to re-order non-coalesced addressing




                                                                               slide by Johan Seland
                                                        Applied Mathematics                      32/53



Friday, January 23, 2009
em
                                                                               gm
                   !quot;#$%&'()*


                           +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
                           +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1=
                               9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@
                              8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@
                              AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@
                           +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%=
                              J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;),
                              &(K%
                              L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
                              0$quot;'M,0%()*,-%#.
                           NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
                              P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

             48


Friday, January 23, 2009
em
                                                                                                          gm
                   !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1&


                                 12         13         14         17                        135 136



                           349        374        378        352        355             395    399   3:4

                                                                  *$$)1>?%#(&)C#?1-'-C#1%



                                 12         13         14         17                        135 136



                           349        374        378        352        355             395    399   3:4

                                                       ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1%


             49


Friday, January 23, 2009
!quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1(

                                                                                                        em
                                                                                                     gm
                                 12         13         14         17                 135 136



                           349        374        378        352        355         395   399   3B4

                                                        :';<=1')*+##'((*>?*@A;'%)(



                                 12         13         14         17         137     135 136



                           349        374        378        352        355         395   399   3B4

                                      C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G


             50


Friday, January 23, 2009
em
                                                                            gm
                   !quot;#$%&'()*+,-(.()*,/%&0$1&

                           234%5(.%)1,quot;),678+,
                              9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'?
                              @A,;$quot;#1&,BCDAEF
                              -(.%&,#G%5#*%:,quot;G%5,C89,50)&
                           CD9,>$quot;'?&,3,DHI,1J5%#:&+
                                @HIK&,L 'quot;#$%&'%:
                                @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1%
                              @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&




             51


Friday, January 23, 2009
em
                                                                                      gm
                   !quot;#$%&'()*+
                   ,-./'-/.%&0quot;10&(2%0! 34054067089-%&
                           :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
                           <;quot;,=

                           ?10,quot;;0(&0)quot;-0@(#A$%+
                                  Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
                                  :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*


                              x        y      z     Point structure


                              x        y      z     x      y      z     x      y      z          AoS



                              x        x      x     y      y      y     z      z      z          SoA


             58


Friday, January 23, 2009
em
                                                                              gm
                   !quot;#$%&'()*+,-.//#01

                           !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2

                           !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$&

                           :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A
                           712%&,B($$,70%#9,'quot;#$%&'()*+
                              C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;-
                              E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG


                           D88(2(quot;)#$,0%&quot;.0'%&+
                              D$(*)%8,I13%&,-JK,-#/3$%




             59


Friday, January 23, 2009
em
                                                                           sm
                   !quot;#quot;$$%$&'%()#*&+#,-./%,/0#%

                           12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#*
                              7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
                              <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/-


                           <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$%
                              +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06&
                                                                            Bank 0
                              quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6
                                                                            Bank 1
                                                                            Bank 2
                                                                            Bank 3
                           '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2;         Bank 4
                           #%60$/&.2&quot;&:quot;2;&,)28$.,/&                       Bank 5
                                                                            Bank 6
                              ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5
                                                                            Bank 7



                                                                           Bank 15
             64


Friday, January 23, 2009
em
                                                                                sm
                   !quot;#$%&''()**+#,%-.quot;/01)*
                           23%!quot;#$%43#51+67*               23%!quot;#$%43#51+67*
                              8+#)quot;(%quot;''()**+#,%               ;quot;#'3/%:<:%=)(/>7quot;7+3#
                              *7(+')%99%:


                       Thread 0                Bank 0    Thread 0              Bank 0
                       Thread 1                Bank 1    Thread 1              Bank 1
                       Thread 2                Bank 2    Thread 2              Bank 2
                       Thread 3                Bank 3    Thread 3              Bank 3
                       Thread 4                Bank 4    Thread 4              Bank 4
                       Thread 5                Bank 5    Thread 5              Bank 5
                       Thread 6                Bank 6    Thread 6              Bank 6
                       Thread 7                Bank 7    Thread 7              Bank 7



                      Thread 15                Bank 15   Thread 15             Bank 15



             65


Friday, January 23, 2009
em
                                                                               sm
                   !quot;#$%&''()**+#,%-.quot;/01)*
                           234quot;5%!quot;#$%67#81+9:*           =34quot;5%!quot;#$%67#81+9:*
                               ;+#)quot;(%quot;''()**+#,%             ;+#)quot;(%quot;''()**+#,%
                               *:(+')%<<%2                    *:(+')%<<%=


                                                                        x8
                       Thread 0                Bank 0   Thread 0              Bank 0
                       Thread 1                Bank 1   Thread 1              Bank 1
                       Thread 2                Bank 2   Thread 2              Bank 2
                       Thread 3                Bank 3   Thread 3
                       Thread 4                Bank 4   Thread 4
                                               Bank 5   Thread 5              Bank 7
                                               Bank 6   Thread 6              Bank 8
                                               Bank 7   Thread 7              Bank 9
                      Thread 8                                           x8
                      Thread 9
                      Thread 10
                      Thread 11               Bank 15   Thread 15             Bank 15



             66


Friday, January 23, 2009
mem
                                                                                                         s
                   !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012

                           3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:()
                           <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($
                           -%./)
                           012$5%)$AB$-%./)
                              <quot;$-%./$C$%&&'())$D$AB
                              <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+
                                 Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+




             67


Friday, January 23, 2009
em
                                                                                   sm
                   !quot;#$%&'(%()$*'+#,-'.),/01.23

                           !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%'
                           ,)'+#,-'.),/01.23

                           5quot;%'/#32'.#3%6
                              7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13'
                              ,)'+#,-'.),/01.2
                              7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;'
                              2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
                           5quot;%'30)9'.#3%6
                              >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:'
                              #..%33'2quot;%'3#(%'+#,-
                              A@32'3%$1#01B%'2quot;%'#..%33%3
                              ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-


             68


Friday, January 23, 2009
egy
                                                                                  rat
                                                                               St
               Use the right kind of memory


                           Constant memory:
                               Quite small, ≈ 20K
                               As fast as register access if all threads in a warp access the
                               same location
                           Texture memory:
                               Spatially cached
                               Optimized for 2D locality
                               Neighboring threads should read neighboring addresses
                               No need to think about coalescing
                           Constraint:
                               These memories can only be updated from the CPU




                                                                                       slide by Johan Seland
                                                        Applied Mathematics                              31/53



Friday, January 23, 2009
egy
                                                                                 rat
                                                                              St
               Memory optimizations roundup



                           CUDA memory handling is complex
                               And I have not covered all topics...
                           Using memory correctly can lead to huge speedups
                               At least CUDA expose the memory hierarchy, unlike CPUs
                           Get your algorithm up an running first, then optimize
                           Use shared memory to let threads cooperate
                           Be wary of “data ownership”
                               A thread does not have to read/write the data it calculate




                                                                                    slide by Johan Seland
                                                        Applied Mathematics                           41/53



Friday, January 23, 2009
Conflicts,
                           Coalescing, Warps...
                           I hate growing up.




Friday, January 23, 2009
ple
                                                   xa m
                                                  E




                    !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.




Friday, January 23, 2009
ple
                                                                            xa m
                                                                           E
                   !quot;#$%&'($quot;)*+,*-

                           ./0'.quot;1+2-'34#$quot;)*+,*-56
                           7228*#$quot;#-*9
                              :,quot;2-*;%)<
                              =>,%?%)<'.!@!'Aquot;)B';,)C2%;#*
                              .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-*



                                                         1   5   9    13
                               1   2    3    4

                                                         2   6   10   14
                               5   6    7    8

                                                         3   7   11   15
                               9   10   11   12

                                                         4   8   12   16
                              13   14   15   16


             70


Friday, January 23, 2009
ple
                                                                                       xa m
                                                                                      E
                   !quot;#$%&'(#')*+,%quot;(-$('



                         __global__ void transpose_naive(float *odata, float *idata, int width, int height)
                         {
                       1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
                       2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

                       3. if (xIndex < width && yIndex < height)
                            {
                              unsigned int index_in = xIndex + width * yIndex;
                       4.
                              unsigned int index_out = yIndex + height * xIndex;
                       5.
                              $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4;
                       6.
                            }
                          }




             71


Friday, January 23, 2009
ple
                                                                               xa m
                                                                              E
                   !quot;#$%&'(#')*+,%quot;(-$('


                           .'%)(*/quot;-01*2,$3*4565         <,/1'*$01-01*1$*4565

                             ;8;   ;87    ;8:    ;879     ;8;   78;    :8;      798;


                             78;   787    78:    7879     ;87   787    :87      7987




                            798;   7987   798:   79879   ;879   7879   :879    79879




                                   4565                               4565


                  Stride = 1, coalesced                   Stride = 16, uncoalesced


             72


Friday, January 23, 2009
ple
                                                                              xa m
                                                                             E
                   !quot;#$%&'%()*+#,&-quot;&%

                           .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%&
                           *6+%#(7$quot;'8)974:)7;<3
                              =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@?
                              A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$%
                                  *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@?

                           *6+%#()914:1;<3
                              =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$%
                              A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$%
                           !quot;#$%&'2,B)2&)#'62%D%()2C3
                              E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH




             73


Friday, January 23, 2009
ple
                                                                              xa m
                                                                             E
                   !quot;#$%&'%()*+#,&-quot;&%
                            4%#(&)5+quot;6)1232               .+/0%&)0quot;)7232

                           <9<    <98    <9;    <98:    <9<    <98    <9;    <98:


                           89<    898    89;    898:    89<    898    89;    898:




                           8:9<   8:98   8:9;   8:98:   8:9<   8:98   8:9;   8:98:




                            4%#(&)5+quot;6)7232              .+/0%&)0quot;)1232

                           <9<    89<    ;9<    8:9<    <9<    <98    <9;    <98:

                           <98    898    ;98    8:98    89<    898    89;    898:




                           <98:   898:   ;98:   8:98:   8:9<   8:98   8:9;   8:98:

             74


Friday, January 23, 2009
ple
                                                                         xa m
                                                                        E
                   !quot;#quot;$%&'()(*+'(,-
                   =1+23$;0,)$!quot;#quot;

                                                 ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
                  A?A      6?A    @?A    6>?A

                                                     8+-9$:,-;<(:'3
                  A?6      6?6    @?6    6>?6




                  A?6>     6?6>   @?6>   6>?6>




                                                 !,<B'(,-
                  A?A      6?A    @?A    6>?A

                                                     C<<,:+'1$+-$D1E'0+F :,<B)-
                  A?6      6?6    @?6    6>?6
                                                     =1+2$3'0(21$5$6G
                                                     ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
                  A?6>     6?6>   @?6>   6>?6>




             75


Friday, January 23, 2009
ple
                                                                         xa m
                                                                        E
                   !quot;#quot;$%&'()(*+'(,-
                   =1+23$;0,)$!quot;#quot;

                                                 ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
                  A?A      6?A    @?A    6>?A

                                                     8+-9$:,-;<(:'3
                  A?6      6?6    @?6    6>?6




                  A?6>     6?6>   @?6>   6>?6>




                                                 !,<B'(,-
                  A?A      6?A    @?A    6>?A

                                                     C<<,:+'1$+-$D1E'0+F :,<B)-
                  A?6      6?6    @?6    6>?6
                                                     =1+2$3'0(21$5$6G
                                                     ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
                  A?6>     6?6>   @?6>   6>?6>




             75


Friday, January 23, 2009
ple
                                                                                                  xa m
                                                                                                 E
                   !quot;#$%&'%()*+#,&-quot;&%
                             __global__ void transpose(float *odata, float *idata, int width, int height)
                             {
                            1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

                                 unsigned int xBlock = blockDim.x * blockIdx.x;
                            2.
                                 unsigned int yBlock = blockDim.y * blockIdx.y;
                            3.
                                 unsigned int xIndex = xBlock + threadIdx.x;
                            4.
                                 unsigned int yIndex = yBlock + threadIdx.y;
                            5.
                                 unsigned int index_out, index_transpose;
                            6.

                            7. if (xIndex < width && yIndex < height)
                               {
                                   unsigned int index_in = width * yIndex + xIndex;
                            8.
                                   unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
                            9.
                                   block[index_block] = idata[index_in];
                           10.
                                   index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
                           11.
                                   index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           12.
                               }
                           13. __syncthreads();

                           14. if (xIndex < width && yIndex < height)
                                   odata[index_out] = block[index_transpose];
                           15.
                             }
             76


Friday, January 23, 2009
ple
                                                                                                    xa m
                                                                                                   E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];

                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;

                           if ( xIndex < width && yIndex < height ) {
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

                             block[index_block] = in[index_in];

                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                    slide by Johan Seland
                                                                         Applied Mathematics                          39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];

                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;

                           if ( xIndex < width && yIndex < height ) {
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

                             block[index_block] = in[index_in];

                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                slide by Johan Seland
                                                                         Applied Mathematics                                      39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;

                           if ( xIndex < width && yIndex < height ) {
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

                             block[index_block] = in[index_in];

                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                slide by Johan Seland
                                                                         Applied Mathematics                                      39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;                                Check that we are within
                                                                                                   domain, calculate more
                           if ( xIndex < width && yIndex < height ) {                              indices
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

                             block[index_block] = in[index_in];

                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                slide by Johan Seland
                                                                         Applied Mathematics                                      39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;                                Check that we are within
                                                                                                   domain, calculate more
                           if ( xIndex < width && yIndex < height ) {                              indices
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
                                                                                                   Write to shared memory.
                             block[index_block] = in[index_in];

                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                slide by Johan Seland
                                                                         Applied Mathematics                                      39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;                                Check that we are within
                                                                                                   domain, calculate more
                           if ( xIndex < width && yIndex < height ) {                              indices
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
                                                                                                   Write to shared memory.
                             block[index_block] = in[index_in];
                                                                                                   Calculate output indices.
                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }
                           __synchthreads();

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                 slide by Johan Seland
                                                                         Applied Mathematics                                       39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;                                Check that we are within
                                                                                                   domain, calculate more
                           if ( xIndex < width && yIndex < height ) {                              indices
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
                                                                                                   Write to shared memory.
                             block[index_block] = in[index_in];
                                                                                                   Calculate output indices.
                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }                                                                       Synchronize.
                           __synchthreads();                                                       NB:outside if-clause

                           if ( xIndex < width && yIndex < height ) {
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                 slide by Johan Seland
                                                                         Applied Mathematics                                       39/53



Friday, January 23, 2009
ple
                                                                                                         xa m
                                                                                                        E
               Coalesced transpose: Source code

                      __global__ void
                      transpose( float *out, float *in, int w, int h ) {                           Allocate shared memory.
                        __shared__ float block[BLOCK_DIM*BLOCK_DIM];
                                                                                                   Set up indexing
                           unsigned int xBlock = blockDim.x * blockIdx.x;
                           unsigned int yBlock = blockDim.y * blockIdx.y;

                           unsigned int xIndex = xBlock + threadIdx.x;
                           unsigned int yIndex = yBlock + threadIdx.y;

                           unsigned int index_out, index_transpose;                                Check that we are within
                                                                                                   domain, calculate more
                           if ( xIndex < width && yIndex < height ) {                              indices
                             unsigned int index_in = width * yIndex + xIndex;
                             unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
                                                                                                   Write to shared memory.
                             block[index_block] = in[index_in];
                                                                                                   Calculate output indices.
                             index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
                             index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
                           }                                                                       Synchronize.
                           __synchthreads();                                                       NB:outside if-clause

                                                                                                   Write to global mem.
                           if ( xIndex < width && yIndex < height ) {
                                                                                                   Different index
                             out[index_out] = block[index_transpose];
                           }
                      }




                                                                                                                 slide by Johan Seland
                                                                         Applied Mathematics                                       39/53



Friday, January 23, 2009
ple
                                                                            xa m
                                                                           E
               Transpose timings



                      Was it worth the trouble?
                             Grid Size     Coalesced   Non-coalesced     Speedup
                             128 × 128     0.011 ms        0.022 ms       2.0×
                             512 × 512      0.07 ms         0.33 ms       4.5×
                            1024 × 1024     0.30 ms         1.92 ms       6.4×
                            1024 × 2048     0.79 ms          6.6 ms       8.4×

                            For me, this is a clear yes.




                                                                               slide by Johan Seland
                                                   Applied Mathematics                           40/53



Friday, January 23, 2009
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0


                                  Execution
                              Optimizations



Friday, January 23, 2009
xec
                                                                                        E
               Know the arithmetic cost of operations

                           4 clock cycles:
                                Floating point: add, multiply, fused multiply-add
                                Integer add, bitwise operations, compare, min, max
                           16 clock cycles:
                                                                        log(x), 32-bit integer
                                reciprocal, reciprocal square root,
                                multiplication
                           32 clock cycles:
                                 sin(x),      cos(x) and      exp(x)
                           36 clock cycles:
                                Floating point division (24-bit version in 20 cycles)
                           Particularly costly:
                                Integer division, modulo
                                Remedy: Replace with shifting whenever possible
                           Double precision (when available) will perform at half the
                           speed

                                                                                          slide by Johan Seland
                                                         Applied Mathematics                                28/53



Friday, January 23, 2009
xec
                                                                      E
                   !quot;quot;#$%&quot;'

                           ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
                           +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
                           4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/'

                           !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6-
                           quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
                           <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
                           quot;1&quot;#**+&04'

                           ?.<.0+,-9'-*+/1#*quot;+-#/%6+@
                              A+6./0+*/
                              B)%*+,-<+<1*'


             79


Friday, January 23, 2009
xec
                                                                              E
                   !quot;#$%&'()*+,#-.+/.0quot;#12#)1

                           3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1
                              ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.


                           3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+>
                              ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot;
                              &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+
                              :9quot;$B9quot;.+501@
                              ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@


                           3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1
                              &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
                              JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1



             80


Friday, January 23, 2009
xec
                                                                                    E
                   !quot;#$%&quot;'()quot;*quot;+,quot;+-.

                           !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-.
                              3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;'
                              ;-quot;+/'$5%<=>)?<            @AB<


                                           S T(.(U(JV         /,,N1O:(((P1OQ(P1EQ(P1:
                                           W(T(S U(OV         /,,N1O:(((P1JQ(P1OQ(P1R

                                           %[,/&/XYZ(UT(OV    7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ(
                                                              /,,N1O:(((((((((((P1OQ(P1OQ(P1R


                           A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<(
                              !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5'
                                 ?&(7quot;/%&(:JK 5--4*/+-.
                              AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M

             81


Friday, January 23, 2009
xec
                                                                                E
                   !quot;#$%&quot;'()'quot;%%*'quot;

                           +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78
                           9$3$&$/#(:.0&4'%;
                              <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;-
                                  ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,%
                              D34*/&(4=(%5.'quot;,(3quot;34'1
                                  @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>%
                           H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;-
                           L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH
                              < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;-
                              D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
                                  !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T
                                  H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot;



             82


Friday, January 23, 2009
xec
                                                                               E
                   !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot;
                           /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6--
                           8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+)
                           =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A
                           82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$)
                           #;quot;)0-+=quot;7 *quot;-#&+'A
                              architecture {sm_10}
                              abiversion {0}
                              modname {cubin}
                              code {
                                                                  per thread local memory
                                  name = BlackScholesGPU
                                  lmem = 0
                                  smem = 68                   per thread block shared memory
                                  reg = 20
                                  bar = 0
                                                                    per thread registers
                                  bincode {
                                       0xa0004205 0x04200780 0x40024c09 0x00200780
                                       …

             83


Friday, January 23, 2009
xec
                                               E
                   !quot;#$%&''()*+',%!*-'(-*./0




             84


Friday, January 23, 2009
xec
                                                                                 E
                   !quot;#$%$&$'()#*+,-./)quot;,+)01234
                           5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
                              9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
                           <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
                           *$.$'(
                           ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
                           #*+,-.
                              A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
                           B,6+$/#$3/
                              <$'$%6%C)DE)#*+,-./)quot;,+)01234
                                 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
                              FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
                                 J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
                              K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M


             85


Friday, January 23, 2009
xec
                                                                               E
                   !quot;quot;#$%&quot;'()*(+,-./-0%&quot;,

                           1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'(
                           3&quot;-,%2,($,-./-0%&quot;,

                                                    BUT…

                           8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'(
                           <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72
                              ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,(
                              $%-%77,7320A




             86


Friday, January 23, 2009
xec
                                                                          E
                   !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1

                           !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73

                           6!73)8quot;#9)'1)$quot;19):quot;93
                              ;)+5)$,/&'.#+0%33+#3
                              <%$+#9)=quot;14:'4&2
                              >2quot;#%4)$%$+#9)3'(%
                              ?%@'3&%#)5'/%)3'(%
                              A2#%quot;43).%#)=/+0B

                           *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14)
                           -AG->H
                              IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/)
                              0+15'@,#quot;&'+1



             87


Friday, January 23, 2009
xec
                                                                                 E
               Loop unrolling



                           Sometimes we know some kernel parameters at compile time:
                               # of loop iterations
                               Degrees of polynomials
                               Number of data elements
                           If we could “tell” this to the compiler, it can unroll loops and
                           optimize register usage
                           We need to be generic
                               Avoid code duplication, sizes unknown at compile time
                           Templates to rescue
                               The same trick can be used for regular C++ sources




                                                                                    slide by Johan Seland
                                                       Applied Mathematics                            43/53



Friday, January 23, 2009
xec
                                                                                                     E
               Example: de Casteljau algorithm


                      A standard algorithm for evaluating polynomials in Bernstein form
                                                                            d
                                                                  f (x) = b00

                  Recursively defined:                                                x      1−x


                              d
                     f (x) = b00                                               d−1                   d−1
                                                                              b10                   b01
                                    k−1             k−1
                            k
                           bi,j = xbi+1,j + (1 − x)bi,j+1
                                                                                                      1 − x2
                                                                          x                     x
                                                                                  1−x
                            0
                           bi,j are   coefficients
                                                                   d−2                    d−2               d−2
                                                                  b20                    b11               b02




                                                                                                       slide by Johan Seland
                                                            Applied Mathematics                                          44/53



Friday, January 23, 2009
xec
                                                                                                   E
               Implementation

                      The de Casteljau algorithm is usually implemented as nested
                      for-loops
                            Coefficients are overwritten for each iteration
                                                                                                 d
                                                                                        f (x) = c00
                  float deCasteljau ( float ∗ c , float x , int d )
                  {                                                                     x      1−x
                    f o r ( u i n t i = 1 ; i <= d ; ++i ) {
                        f o r ( u i n t j = 0 ; j <= d− i ; ++j )
                                                                                    d−1                 d−1
                                                                                   c10                 c01
                            c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
                    }

                                                                                                           1 − x2
                                                                                    x              x
                                                                                        1−x
                       return c [ 0 ] ;
                  }
                                                                          d−2                d−2                 d−2
                                                                         c20                c11                 c02



                                                                                                       slide by Johan Seland
                                                              Applied Mathematics                                        45/53



Friday, January 23, 2009
xec
                                                                                                      E
               Template loop unrolling
                           We make d a template parameter
                           template<int d>
                           f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) {
                                f o r ( u i n t i = 1 ; i <= d ; ++i ) {
                                    f o r ( u i n t j = 0 ; j <= d− i ; ++j )
                                        c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
                               }
                                return c [ 0 ] ;
                           }


                           Kernel is called as
                           switch ( d ) {
                           case 1:
                               d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
                           case 2:
                               d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
                           .
                           .
                           c a s e MAXD:
                               d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
                           }

                                                                                                      slide by Johan Seland
                                                                Applied Mathematics                                     46/53



Friday, January 23, 2009
xec
                                                                                E
               Results




                           For the de Castelaju algorithm we see a relatively small
                           speedup
                               ≈ 1.2× (20%...)
                           Very easy to implement
                           Can lead to long compile times
                      Conclusion:
                           Probably worth it near end of development cycle




                                                                                  slide by Johan Seland
                                                      Applied Mathematics                           47/53



Friday, January 23, 2009
xec
                                                                            E
                   !quot;#$%&'(quot;#
                           )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($'
                              6+4quot;,7/$quot;.%+'$(#8
                              0(9+,8+#-/:,.#$5(#8
                              ;.#</$quot;#3%($-'
                              =.-+#$7/5(*(#8
                           )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/
                           )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7
                           @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$<
                              +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.*
                           D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2'
                           )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+



             88


Friday, January 23, 2009
ing
                                                                             ofil
                                                                           Pr
                   !quot;#$%&'($)*+,-.$/012*.#0

                           3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
                           401:.#5
                              ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
                              5#594?+
                              !*5#$+8-54+


                           (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$




             61


Friday, January 23, 2009
ing
                                                                                            ofil
                                                                                          Pr
                   !quot;#$%&'
                           ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56

                               +quot;7*'+%75

                               #&08quot;$.32*-*$+
                                                         Global memory loads/stores are coalesced
                               #&08.32*-*$+
                                                         (coherent) or non-coalesced (incoherent)
                               #'+8quot;$.32*-*$+
                               #'+8.32*-*$+

                               &3.%&8&3%0
                                                          Local loads/stores
                               &3.%&8'+3-*

                                                         Total branches and divergent branches
                               9-%$.2
                               0quot;)*-#*$+89-%$.2          taken by threads

                               quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+

                               1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3,
                               '2%-*0,3-,.3$'+%$+,7*73-=

                               .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'


             62


Friday, January 23, 2009
ing
                                                                              ofil
                                                                            Pr
                   !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/

                           01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%&

                           6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*%
                              01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/)
                              ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,;
                              <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#)
                              8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,)
                              3*%:;

                           01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$)
                           5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$
                              !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+)
                              (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$


             63


Friday, January 23, 2009
ple
                                                                                                  xam
                                                                                              E
                           !quot;#$%#&'()quot;*$%#*+,*quot;-quot;&quot;(.*#quot;/0).1%(
                                                                                           M.quot;C  Q0&0-'.1Jquot;
                                                          N1&quot;*O444*1(.>P   <'(/F1/.D    MCquot;quot;/0C MCquot;quot;/0C
                           Aquot;#(quot;-*2B*
                                                               638@+*&>     43869*;<=>
                           1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G
                           F1.D*/1Jquot;#Gquot;(.*H#'()D1(G

                           Aquot;#(quot;-*4B
                                                               93+@:*&>     +36@+*;<=>    43995       43995
                           1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G
                           F1.D*H'(K*)%($-1).>

                           Aquot;#(quot;-*9B                           23744*&>     ?37+2*;<=>    43825       +3:65
                           >quot;I0quot;(.1'-*'//#quot;>>1(G

                           Aquot;#(quot;-*+B                           83?:@*&> 273977*;<=>       23765       639+5
                           $1#>.*'//*/0#1(G*G-%H'-*-%'/

                           Aquot;#(quot;-*@B                           83@9:*&> 92346?*;<=>        2365      2@3825
                           0(#%--*-'>.*F'#C

                           Aquot;#(quot;-*:B                           83962*&> +93??:*;<=>       23+25      4232:5
                           )%&C-quot;.quot;-E*0(#%--quot;/

                           Aquot;#(quot;-*7B                           834:6*&> :43:72*;<=>       23+45      9838+5
                           &0-.1C-quot;*quot;-quot;&quot;(.>*Cquot;#*.D#quot;'/


                                                          Aquot;#(quot;-*7*%(*94,*quot;-quot;&quot;(.>B*74*;<=>L
                                                                                                         84




Friday, January 23, 2009
n!
                                ow
                            our
                           y
           ild
        Bu




Friday, January 23, 2009
Friday, January 23, 2009
ou!
                                ky
                             an
                           Th
                                         slide by David NVIDIA Corpora
                                                  © 2008 Kirk


Friday, January 23, 2009
Back Pocket Slides




                             slide by David Cox


Friday, January 23, 2009
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
                           9
         IAP0




                                                   Misc



Friday, January 23, 2009
Tesla C1060 Computing Processor
                                                                     Processor            1x Tesla T10P

                                                                      Core GHz              1.33 GHz

                                                                                            Full ATX:
                                                                    Form factor       4.736” (H) x 10.5” (L)
                                                                                          Dual slot wide
                                                                      On-board
                                                                                              4 GB
                                                                      memory
                                                                     System I/O           PCIe x16 gen2
                                                                                       512-bit, 800MHz DDR
                                                                    Memory I/O
                                                                                     102 GB/s peak bandwidth

                                                                   Display outputs            None

                                                                   Typical power             160 W


                                                                                                       19
                       M02: High Performance Computing with CUDA




Friday, January 23, 2009
Tesla S1070 1U System
                                                                     Processors         4 x Tesla T10P

                                                                      Core GHz             1.5 GHz

                                                                                       1U for an EIA 19”
                                                                     Form factor
                                                                                          4-post rack
                                                                   Total 1U system
                                                                                     16 GB (4.0GB per GPU)
                                                                   memory
                                                                     System I/O           2 PCIe x16
                                                                                     512-bit, 800MHz GDDR
                                                                   Memory I/O per
                                                                                         102 GB/s peak
                                                                   processor
                                                                                          bandwidth

                                                                   Display outputs           None

                                                                   Typical power            700 W

                                                                     Chassis         1.73” H ! 17.5” W !
                                                                                           28.5” D
                                                                     dimensions

                                                                                                    20
                       M02: High Performance Computing with CUDA




Friday, January 23, 2009
Double Precision Floating Point
                                                          NVIDIA GPU                    SSE2                  Cell SPE
                                                 IEEE 754                    IEEE 754                   IEEE 754
                   Precision
                   Rounding modes for FADD       All 4 IEEE, round to        All 4 IEEE, round to       Round to
                   and FMUL                      nearest, zero, inf, -inf    nearest, zero, inf, -inf   zero/truncate only
                                                                             Supported, costs 1000’s
                   Denormal handling             Full speed                                             Flush to zero
                                                                             of cycles
                   NaN support                   Yes                         Yes                        No
                   Overflow and Infinity                                                                No infinity,
                                                 Yes                         Yes
                   support                                                                              clamps to max norm
                   Flags                         No                          Yes                        Some
                   FMA                           Yes                         No                         Yes
                                                 Software with low-latency
                   Square root                                               Hardware                   Software only
                                                 FMA-based convergence
                                                 Software with low-latency
                   Division                                                  Hardware                   Software only
                                                 FMA-based convergence
                   Reciprocal estimate
                                                 24 bit                      12 bit                     12 bit
                   accuracy
                   Reciprocal sqrt estimate
                                                 23 bit                      12 bit                     12 bit
                   accuracy
                   log2(x) and 2^x estimates
                                                 23 bit                      No                         No
                   accuracy
                                                                                                                 18
                       M02: High Performance Computing with CUDA




Friday, January 23, 2009

Weitere ähnliche Inhalte

Mehr von npinto

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 

Mehr von npinto (20)

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 

Kürzlich hochgeladen

Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsArubSultan
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfChristalin Nelson
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipKarl Donert
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxMadhavi Dharankar
 

Kürzlich hochgeladen (20)

Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristics
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdf
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenship
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptx
 

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

  • 1. 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 Nicolas Pinto (MIT) #2 CUDA - Advanced Friday, January 23, 2009
  • 2. During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-) Friday, January 23, 2009
  • 3. Today yey!! Friday, January 23, 2009
  • 4. Wanna Play with The Big Guys? Friday, January 23, 2009
  • 5. Here are the keys to High-Performance in CUDA Friday, January 23, 2009
  • 6. ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “Premature optimization is the root of all evil.” slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  • 7. ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil.” ⇓ 3% of the time we really should worry about small efficiencies (Every 33rd codeline) slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  • 8. 6.963 IT / A@M CUD 9 IAP0 Strategy Memory Optimizations Execution Optimizations Friday, January 23, 2009
  • 9. 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies Friday, January 23, 2009
  • 10. egy rat St Optimization goals We should strive to reach GPU performance We must know the GPU performance Vendor specifications Syntetic benchmarks Choose a performance metric Memory bandwidth or GFLOPS? Use clock() to measure Experiment and profile! slide by Johan Seland Applied Mathematics 25/53 Friday, January 23, 2009
  • 11. ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
  • 12. mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10 Friday, January 23, 2009
  • 13. erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39 Friday, January 23, 2009
  • 14. erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40 Friday, January 23, 2009
  • 15. erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41 Friday, January 23, 2009
  • 16. erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42 Friday, January 23, 2009
  • 18. 6.963 IT / A@M CUD 9 IAP0 Memory Optimizations Friday, January 23, 2009
  • 19. ory em M !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44 Friday, January 23, 2009
  • 20. ory em M !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45 Friday, January 23, 2009
  • 21. ory em M !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46 Friday, January 23, 2009
  • 22. em gm !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47 Friday, January 23, 2009
  • 23. em gm Accessing global memory 4 cycles to issue on memory fetch but 400-600 cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing slide by Johan Seland Applied Mathematics 32/53 Friday, January 23, 2009
  • 24. em gm !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48 Friday, January 23, 2009
  • 25. em gm !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49 Friday, January 23, 2009
  • 26. !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( em gm 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50 Friday, January 23, 2009
  • 27. em gm !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51 Friday, January 23, 2009
  • 28. em gm !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58 Friday, January 23, 2009
  • 29. em gm !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59 Friday, January 23, 2009
  • 30. em sm !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64 Friday, January 23, 2009
  • 31. em sm !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65 Friday, January 23, 2009
  • 32. em sm !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66 Friday, January 23, 2009
  • 33. mem s !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67 Friday, January 23, 2009
  • 34. em sm !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68 Friday, January 23, 2009
  • 35. egy rat St Use the right kind of memory Constant memory: Quite small, ≈ 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU slide by Johan Seland Applied Mathematics 31/53 Friday, January 23, 2009
  • 36. egy rat St Memory optimizations roundup CUDA memory handling is complex And I have not covered all topics... Using memory correctly can lead to huge speedups At least CUDA expose the memory hierarchy, unlike CPUs Get your algorithm up an running first, then optimize Use shared memory to let threads cooperate Be wary of “data ownership” A thread does not have to read/write the data it calculate slide by Johan Seland Applied Mathematics 41/53 Friday, January 23, 2009
  • 37. Conflicts, Coalescing, Warps... I hate growing up. Friday, January 23, 2009
  • 38. ple xa m E !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3. Friday, January 23, 2009
  • 39. ple xa m E !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70 Friday, January 23, 2009
  • 40. ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71 Friday, January 23, 2009
  • 41. ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72 Friday, January 23, 2009
  • 42. ple xa m E !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73 Friday, January 23, 2009
  • 43. ple xa m E !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74 Friday, January 23, 2009
  • 44. ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  • 45. ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  • 46. ple xa m E !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76 Friday, January 23, 2009
  • 47. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 48. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 49. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 50. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 51. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 52. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 53. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 54. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause Write to global mem. if ( xIndex < width && yIndex < height ) { Different index out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • 55. ple xa m E Transpose timings Was it worth the trouble? Grid Size Coalesced Non-coalesced Speedup 128 × 128 0.011 ms 0.022 ms 2.0× 512 × 512 0.07 ms 0.33 ms 4.5× 1024 × 1024 0.30 ms 1.92 ms 6.4× 1024 × 2048 0.79 ms 6.6 ms 8.4× For me, this is a clear yes. slide by Johan Seland Applied Mathematics 40/53 Friday, January 23, 2009
  • 57. 6.963 IT / A@M CUD 9 IAP0 Execution Optimizations Friday, January 23, 2009
  • 58. xec E Know the arithmetic cost of operations 4 clock cycles: Floating point: add, multiply, fused multiply-add Integer add, bitwise operations, compare, min, max 16 clock cycles: log(x), 32-bit integer reciprocal, reciprocal square root, multiplication 32 clock cycles: sin(x), cos(x) and exp(x) 36 clock cycles: Floating point division (24-bit version in 20 cycles) Particularly costly: Integer division, modulo Remedy: Replace with shifting whenever possible Double precision (when available) will perform at half the speed slide by Johan Seland Applied Mathematics 28/53 Friday, January 23, 2009
  • 59. xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
  • 60. xec E !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80 Friday, January 23, 2009
  • 61. xec E !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81 Friday, January 23, 2009
  • 62. xec E !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82 Friday, January 23, 2009
  • 63. xec E !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83 Friday, January 23, 2009
  • 64. xec E !quot;#$%&''()*+',%!*-'(-*./0 84 Friday, January 23, 2009
  • 65. xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
  • 66. xec E !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86 Friday, January 23, 2009
  • 67. xec E !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87 Friday, January 23, 2009
  • 68. xec E Loop unrolling Sometimes we know some kernel parameters at compile time: # of loop iterations Degrees of polynomials Number of data elements If we could “tell” this to the compiler, it can unroll loops and optimize register usage We need to be generic Avoid code duplication, sizes unknown at compile time Templates to rescue The same trick can be used for regular C++ sources slide by Johan Seland Applied Mathematics 43/53 Friday, January 23, 2009
  • 69. xec E Example: de Casteljau algorithm A standard algorithm for evaluating polynomials in Bernstein form d f (x) = b00 Recursively defined: x 1−x d f (x) = b00 d−1 d−1 b10 b01 k−1 k−1 k bi,j = xbi+1,j + (1 − x)bi,j+1 1 − x2 x x 1−x 0 bi,j are coefficients d−2 d−2 d−2 b20 b11 b02 slide by Johan Seland Applied Mathematics 44/53 Friday, January 23, 2009
  • 70. xec E Implementation The de Casteljau algorithm is usually implemented as nested for-loops Coefficients are overwritten for each iteration d f (x) = c00 float deCasteljau ( float ∗ c , float x , int d ) { x 1−x f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) d−1 d−1 c10 c01 c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } 1 − x2 x x 1−x return c [ 0 ] ; } d−2 d−2 d−2 c20 c11 c02 slide by Johan Seland Applied Mathematics 45/53 Friday, January 23, 2009
  • 71. xec E Template loop unrolling We make d a template parameter template<int d> f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } return c [ 0 ] ; } Kernel is called as switch ( d ) { case 1: d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; case 2: d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; . . c a s e MAXD: d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; } slide by Johan Seland Applied Mathematics 46/53 Friday, January 23, 2009
  • 72. xec E Results For the de Castelaju algorithm we see a relatively small speedup ≈ 1.2× (20%...) Very easy to implement Can lead to long compile times Conclusion: Probably worth it near end of development cycle slide by Johan Seland Applied Mathematics 47/53 Friday, January 23, 2009
  • 73. xec E !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88 Friday, January 23, 2009
  • 74. ing ofil Pr !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61 Friday, January 23, 2009
  • 75. ing ofil Pr !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62 Friday, January 23, 2009
  • 76. ing ofil Pr !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63 Friday, January 23, 2009
  • 77. ple xam E !quot;#$%#&'()quot;*$%#*+,*quot;-quot;&quot;(.*#quot;/0).1%( M.quot;C Q0&0-'.1Jquot; N1&quot;*O444*1(.>P <'(/F1/.D MCquot;quot;/0C MCquot;quot;/0C Aquot;#(quot;-*2B* 638@+*&> 43869*;<=> 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*/1Jquot;#Gquot;(.*H#'()D1(G Aquot;#(quot;-*4B 93+@:*&> +36@+*;<=> 43995 43995 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*H'(K*)%($-1).> Aquot;#(quot;-*9B 23744*&> ?37+2*;<=> 43825 +3:65 >quot;I0quot;(.1'-*'//#quot;>>1(G Aquot;#(quot;-*+B 83?:@*&> 273977*;<=> 23765 639+5 $1#>.*'//*/0#1(G*G-%H'-*-%'/ Aquot;#(quot;-*@B 83@9:*&> 92346?*;<=> 2365 2@3825 0(#%--*-'>.*F'#C Aquot;#(quot;-*:B 83962*&> +93??:*;<=> 23+25 4232:5 )%&C-quot;.quot;-E*0(#%--quot;/ Aquot;#(quot;-*7B 834:6*&> :43:72*;<=> 23+45 9838+5 &0-.1C-quot;*quot;-quot;&quot;(.>*Cquot;#*.D#quot;'/ Aquot;#(quot;-*7*%(*94,*quot;-quot;&quot;(.>B*74*;<=>L 84 Friday, January 23, 2009
  • 78. n! ow our y ild Bu Friday, January 23, 2009
  • 80. ou! ky an Th slide by David NVIDIA Corpora © 2008 Kirk Friday, January 23, 2009
  • 81. Back Pocket Slides slide by David Cox Friday, January 23, 2009
  • 83. 6.963 IT / A@M CUD 9 IAP0 Misc Friday, January 23, 2009
  • 84. Tesla C1060 Computing Processor Processor 1x Tesla T10P Core GHz 1.33 GHz Full ATX: Form factor 4.736” (H) x 10.5” (L) Dual slot wide On-board 4 GB memory System I/O PCIe x16 gen2 512-bit, 800MHz DDR Memory I/O 102 GB/s peak bandwidth Display outputs None Typical power 160 W 19 M02: High Performance Computing with CUDA Friday, January 23, 2009
  • 85. Tesla S1070 1U System Processors 4 x Tesla T10P Core GHz 1.5 GHz 1U for an EIA 19” Form factor 4-post rack Total 1U system 16 GB (4.0GB per GPU) memory System I/O 2 PCIe x16 512-bit, 800MHz GDDR Memory I/O per 102 GB/s peak processor bandwidth Display outputs None Typical power 700 W Chassis 1.73” H ! 17.5” W ! 28.5” D dimensions 20 M02: High Performance Computing with CUDA Friday, January 23, 2009
  • 86. Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE IEEE 754 IEEE 754 IEEE 754 Precision Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only Supported, costs 1000’s Denormal handling Full speed Flush to zero of cycles NaN support Yes Yes No Overflow and Infinity No infinity, Yes Yes support clamps to max norm Flags No Yes Some FMA Yes No Yes Software with low-latency Square root Hardware Software only FMA-based convergence Software with low-latency Division Hardware Software only FMA-based convergence Reciprocal estimate 24 bit 12 bit 12 bit accuracy Reciprocal sqrt estimate 23 bit 12 bit 12 bit accuracy log2(x) and 2^x estimates 23 bit No No accuracy 18 M02: High Performance Computing with CUDA Friday, January 23, 2009