SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
CUDA Tricks and
Computational Physics
                   Kipton Barros
                 Boston University

In collaboration with R. Babich, R. Brower, M. Clark,
                 C. Rebbi, J. Ellowitz
High energy physics
huge computational needs
   Large Hadron Collider, CERN




                      27 km
A request:

       Please question/comment
         freely during the talk

A disclaimer:
       I’m not a high energy physicist
View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN)
                                           .
15 Petabytes to be processed annually
View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)
The “Standard Model” of Particle Physics
I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve
Big questions:
           why do quarks appear in groups?
          physics during big bang?
Quantum
ChromoDynamics
The theory of nuclear
     interactions
                                                      (bound by “gluons”)



Extremely difficult:
     Must work at the level of fields, not particles
     Calculation is quantum mechanical
Lattice QCD:
    Solving Quantum Chromodynamics by Computer




               Discretize space and time
       (place the quarks and gluons on a 4D lattice)
Spacetime = 3+1 dimensions
                                       32 ∼ 10
                                             4       6
                                                         lattice sites



     Quarks live on sites           (24 floats each)

     Gluons live on links            (18 floats each)
                            lattice sites
                       4 × 324 × (24 + 4 × 18) ∼ 384MB
Total system size
                                                     gluons
                    float bytes              quarks
Lattice QCD:
 Inner loop requires repeatedly solving linear equation

                                                  quarks
                             gluons

DW     is a sparse matrix
with only nearest neighbor
         couplings



DW     needs to be fast!
DW
Operation of

           1 output quark site
                (24 floats)
DW
Operation of

           1 output quark site
                (24 floats)

          2x4 input quark sites
               (24x8 floats)
DW
Operation of

           1 output quark site
                (24 floats)

          2x4 input quark sites
               (24x8 floats)

          2x4 input gluon links
               (18x8 floats)
DW
             Operation of

                           1 output quark site
                                (24 floats)

                          2x4 input quark sites
                               (24x8 floats)

                          2x4 input gluon links
                               (18x8 floats)


1.4 kB of local storage required per quark update?
Cuda Parallelization:
   Must process many quark updates simultaneously
   Odd/even sites processed separately
ding
                                                                                              hrea
                                                                                    T
                   Programming Model
                                                             Host            Device
                            A kernel is executed as a                              Grid 1
                            grid of thread blocks                                    Block     Block    Block
                                                                Kernel
                            A thread block is a batch                                (0, 0)    (1, 0)   (2, 0)
                                                                  1

                            of threads that can                                      Block     Block    Block
                            cooperate with each                                      (0, 1)    (1, 1)   (2, 1)

                            other by:
                                                                                     Grid 2
                                      Sharing data through
                                      shared memory            Kernel
                                                                 2
                                      Synchronizing their
                                      execution
                                                                    Block (1, 1)


                            Threads from different
                                                                     Thread Thread Thread Thread Thread
                                                                      (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

                            blocks cannot cooperate                  Thread Thread Thread Thread Thread
                                                                      (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

                                                                     Thread Thread Thread Thread Thread
                                                                      (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

                                                                                                    3
                   © NVIDIA Corporation 2006




Friday, January 23, 2009
DW parallelization:
  Each thread processes 1 site
 No communication required bet ween threads!
 All threads in warp execute same code
Step 1: Read neighbor site
Step 1: Read neighbor site

Step 2: Read neighbor link
Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site

                             Step 5: Read neighbor link
Step 2: Read neighbor link

Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site

                             Step 5: Read neighbor link
Step 2: Read neighbor link

                             Step 6: Accumulate into
Step 3: Accumulate into
xec
                                                                      E
                   !quot;quot;#$%"'

                           ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
                           +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
                           4%0+".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

                           !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6-
                           quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
                           <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
                           quot;1&quot;#**+&04'

                           ?.<.0+,-9'-*+/1#*quot;+-#/%6+@
                                                             !!!
                              A+6./0+*/
                              B)%*+,-<+<1*'


             79


Friday, January 23, 2009
xec
                                                                                 E
                   !quot;#$%$&$'()#*+,-./)quot;,+)01234
                           5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
                              9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
                           <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
                           *$.$'(
                           ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
                           #*+,-.
                              A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
                           B,6+$/#$3/
                              <$'$%6%C)DE)#*+,-./)quot;,+)01234
                                 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
                              FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
                                 J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
                              K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M


             85


Friday, January 23, 2009
Reminder -- each multiprocessor has:

        16 kb shared memory
        16 k registers
         1024 active threads (max)


 High occupancy needed for
                                (roughly 25% or so)
  maximum performance
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

                                   24 12 floats

                                   18 floats

                                    24 floats
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb shared mem

     Threads/MP = 16 / 0.2 = 80
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb shared mem

     Threads/MP = 16 / 0.2 = 80 64
                               (multiple of 64 only)
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb shared mem

     Threads/MP = 16 / 0.2 = 80 64
                               (multiple of 64 only)


     MP occupancy = 64/1024 = 6%
6% occupancy
                        sounds pretty
                            bad!



Andreas Kuehn / Getty
How can we get better occupancy?


Reminder -- each multiprocessor has:

        16 kb shared memory
         16 k registers
         1024 active threads (max)

Each thread requires 0.2 kb
   of fast local memory
How can we get better occupancy?


Reminder -- each multiprocessor has:

        16 kb shared memory            Occupancy > 25%
         16 k registers = 64 kb memory
         1024 active threads

Each thread requires 0.2 kb
   of fast local memory
Registers as data
      (possible because no inter-thread communication)

Instead of shared memory




Registers are allocated as
Registers as data
        Can’t be indexed. All loops must be
               EXPLICITLY expanded
Code sample




(approx. 1000 LOC automatically generated)
Performance Results:

              44 Gigabytes/sec      (Tesla C870)

              82 Gigabytes/sec      (GTX 280)
                   (90 Gflops/s)

             (completely bandwidth limited)

For comparison:
     t wice as fast as Cell impl.      (arXiv:0804.3654)

     20 times faster than CPU implementations
GB/s vs Occupancy
         Tesla C870                           GTX 280
GB/s                               GB/s
45.00                              85.00

33.75                              63.75

22.50                             42.50

11.25                              21.25

   0                                  0
        ≥ 25%   17%     8%   0%            ≥ 19%   13%     6%   0%
            Occupancy                          Occupancy


                Surprise! Very robust to low occupancy
Device memory is the bottleneck
       Coalesced memory accesses crucial
               Data reordering

       Quark 1                    Quark 2         Quark 3

                        q21 , q22 , ...q224 q31 , q32 , ...q324   ...
  q11 , q12 , ...q124




     q11 q21 q31 ...                q12 q22 q32 ...

          thread 1
                            ...
    thread 0     thread 2
Memory coalescing: store even/odd
      lattices separately
When memory access isn’t perfectly coalesced
               Sometimes float4 arrays can hide latency


                                     This global memory read
                                   corresponds to a single CUDA
                                            instruction




                                 In case of coalesce miss, at least
                                       4x data is transfered
thread 0   thread 1   thread 2
When memory access isn’t perfectly coalesced

               Binding to textures can help



                                   corresponds to a single
                                      CUDA instruction




This makes use of the texture cache and can reduce penalty
              for nearly coalesced accesses
Regarding textures, there are t wo kinds of memory:

   Linear array


          Can be modified in kernel
          Can only be bound to 1D texture


   “Cuda array”


          Can’t be modifed in kernel
          Gets reordered for 2D, 3D locality
          Allows various hardware features
When a CUDA array is bound to a 2D texture, it is
       probably reordered to something like a Z-cur ve




                                    This gives 2D locality




Wikipedia image
Warnings:
      The effectiveness of float4, textures, depends
           on the CUDA hardware and driver (!)

       Certain “magic” access patterns are many
                times faster than others

       Testing appears to be necessary
Memory bandwidth test
   Simple kernel




         Memory access completely coalesced
         Should be optimal
Memory bandwidth test
   Simple kernel




         Memory access completely coalesced

           Bandwidth:    54 Gigabytes / sec
                   (GTX 280, 140 GB/s theoretical!)
So why are NVIDIA samples so fast?


                           NVIDIA actually uses




54 Gigabytes / sec                  102 Gigabytes / sec


              (GTX 280, 140 GB/s theoretical)
Naive access pattern

Step 1                       ...



                             ...
         Block 1   Block 2


Step 2                             ...


                             ...
         Block 1   Block 2
Modified access pattern              (much more efficient)


 Step 1                              ...



                              ...
          Block 1   Block 2

                                           ...
 Step 2


                              ...
          Block 1   Block 2
CUDA Compiler
                             (LOTS of
                           optimization
                               here)
CUDA            PTX                       CUDA machine
C code          code                          code


           Use unofficial CUDA disassembler to
                view CUDA machine code

                                              CUDA
                                          disassembly
CUDA Disassembler           (decuda)


                                        foo.cu
Compile and save cubin file


Disassemble
Look how CUDA
implements integer
     division!
CUDA provides fast (but imperfect)
   trigonometry in hardware!
The compiler is very aggressive in optimization. It will
   group memory loads together to minimize latency

(snippet from LQCD)




Notice: each thread reads 20 floats!

Weitere ähnliche Inhalte

Was ist angesagt?

Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Introduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlIntroduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
Parallel aes implementation
Parallel aes implementationParallel aes implementation
Parallel aes implementationPiyush Mittal
 
The Technology behind Shadow Warrior, ZTG 2014
The Technology behind Shadow Warrior, ZTG 2014The Technology behind Shadow Warrior, ZTG 2014
The Technology behind Shadow Warrior, ZTG 2014Jarosław Pleskot
 
Intro To .Net Threads
Intro To .Net ThreadsIntro To .Net Threads
Intro To .Net Threadsrchakra
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnnNAVER D2
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB Rakuten Group, Inc.
 
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15Jarosław Pleskot
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With TerracottaPT.JUG
 
Private Range Query by Perturbation and Matrix Based Encryption
Private Range Query by Perturbation and Matrix Based EncryptionPrivate Range Query by Perturbation and Matrix Based Encryption
Private Range Query by Perturbation and Matrix Based EncryptionJunpei Kawamoto
 
JVM Garbage Collection Tuning
JVM Garbage Collection TuningJVM Garbage Collection Tuning
JVM Garbage Collection Tuningihji
 
Varnish in action pbc10
Varnish in action pbc10Varnish in action pbc10
Varnish in action pbc10Combell NV
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Yandex
 

Was ist angesagt? (20)

Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Introduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlIntroduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission Control
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Parallel aes implementation
Parallel aes implementationParallel aes implementation
Parallel aes implementation
 
Internet Programming with Java
Internet Programming with JavaInternet Programming with Java
Internet Programming with Java
 
ClassLoader Leaks
ClassLoader LeaksClassLoader Leaks
ClassLoader Leaks
 
The Technology behind Shadow Warrior, ZTG 2014
The Technology behind Shadow Warrior, ZTG 2014The Technology behind Shadow Warrior, ZTG 2014
The Technology behind Shadow Warrior, ZTG 2014
 
Intro To .Net Threads
Intro To .Net ThreadsIntro To .Net Threads
Intro To .Net Threads
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
 
近未来的並列 LL
近未来的並列 LL近未来的並列 LL
近未来的並列 LL
 
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15
Shadow Warrior 2 and the evolution of the Roadhog Engine, GIC15
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With Terracotta
 
Multi threading
Multi threadingMulti threading
Multi threading
 
Private Range Query by Perturbation and Matrix Based Encryption
Private Range Query by Perturbation and Matrix Based EncryptionPrivate Range Query by Perturbation and Matrix Based Encryption
Private Range Query by Perturbation and Matrix Based Encryption
 
JVM Garbage Collection Tuning
JVM Garbage Collection TuningJVM Garbage Collection Tuning
JVM Garbage Collection Tuning
 
Varnish in action pbc10
Varnish in action pbc10Varnish in action pbc10
Varnish in action pbc10
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"
 
Project Final Report
Project Final ReportProject Final Report
Project Final Report
 

Ähnlich wie IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Testing multithreaded java applications for synchronization problems
Testing multithreaded java applications for synchronization problemsTesting multithreaded java applications for synchronization problems
Testing multithreaded java applications for synchronization problemsVassil Popovski
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js ExplainedJeff Kunkle
 
5113jgraph02
5113jgraph025113jgraph02
5113jgraph02graphhoc
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
110864103 adventures-in-bug-hunting
110864103 adventures-in-bug-hunting110864103 adventures-in-bug-hunting
110864103 adventures-in-bug-huntingbob dobbs
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your gameschangehee lee
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Productionbilyushonak
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingaporeCheng Feng
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseChristopher Choi
 
บทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationบทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationrubtumproject.com
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 

Ähnlich wie IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU) (20)

GPU programming
GPU programmingGPU programming
GPU programming
 
Testing multithreaded java applications for synchronization problems
Testing multithreaded java applications for synchronization problemsTesting multithreaded java applications for synchronization problems
Testing multithreaded java applications for synchronization problems
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
 
Guild Prototype
Guild PrototypeGuild Prototype
Guild Prototype
 
5113jgraph02
5113jgraph025113jgraph02
5113jgraph02
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
110864103 adventures-in-bug-hunting
110864103 adventures-in-bug-hunting110864103 adventures-in-bug-hunting
110864103 adventures-in-bug-hunting
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your games
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Shuronr
ShuronrShuronr
Shuronr
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
บทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationบทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplication
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 

Mehr von npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 

Mehr von npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 

Kürzlich hochgeladen

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Kürzlich hochgeladen (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

  • 1. CUDA Tricks and Computational Physics Kipton Barros Boston University In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz
  • 2. High energy physics huge computational needs Large Hadron Collider, CERN 27 km
  • 3. A request: Please question/comment freely during the talk A disclaimer: I’m not a high energy physicist
  • 4. View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN) .
  • 5. 15 Petabytes to be processed annually View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)
  • 6. The “Standard Model” of Particle Physics
  • 7. I’ll discuss Quantum ChromoDynamics Although it’s “standard”, these equations are hard to solve Big questions: why do quarks appear in groups? physics during big bang?
  • 8.
  • 9. Quantum ChromoDynamics The theory of nuclear interactions (bound by “gluons”) Extremely difficult: Must work at the level of fields, not particles Calculation is quantum mechanical
  • 10. Lattice QCD: Solving Quantum Chromodynamics by Computer Discretize space and time (place the quarks and gluons on a 4D lattice)
  • 11. Spacetime = 3+1 dimensions 32 ∼ 10 4 6 lattice sites Quarks live on sites (24 floats each) Gluons live on links (18 floats each) lattice sites 4 × 324 × (24 + 4 × 18) ∼ 384MB Total system size gluons float bytes quarks
  • 12. Lattice QCD: Inner loop requires repeatedly solving linear equation quarks gluons DW is a sparse matrix with only nearest neighbor couplings DW needs to be fast!
  • 13. DW Operation of 1 output quark site (24 floats)
  • 14. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats)
  • 15. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats)
  • 16. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats) 1.4 kB of local storage required per quark update?
  • 17. Cuda Parallelization: Must process many quark updates simultaneously Odd/even sites processed separately
  • 18. ding hrea T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
  • 19. DW parallelization: Each thread processes 1 site No communication required bet ween threads! All threads in warp execute same code
  • 20. Step 1: Read neighbor site
  • 21. Step 1: Read neighbor site Step 2: Read neighbor link
  • 22. Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
  • 23. Step 4: Read neighbor site Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
  • 24. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 3: Accumulate into
  • 25. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 6: Accumulate into Step 3: Accumulate into
  • 26. xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ !!! A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
  • 27. xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
  • 28. Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) High occupancy needed for (roughly 25% or so) maximum performance
  • 29. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory 24 12 floats 18 floats 24 floats
  • 30. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80
  • 31. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only)
  • 32. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only) MP occupancy = 64/1024 = 6%
  • 33. 6% occupancy sounds pretty bad! Andreas Kuehn / Getty
  • 34. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) Each thread requires 0.2 kb of fast local memory
  • 35. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory Occupancy > 25% 16 k registers = 64 kb memory 1024 active threads Each thread requires 0.2 kb of fast local memory
  • 36. Registers as data (possible because no inter-thread communication) Instead of shared memory Registers are allocated as
  • 37. Registers as data Can’t be indexed. All loops must be EXPLICITLY expanded
  • 38. Code sample (approx. 1000 LOC automatically generated)
  • 39. Performance Results: 44 Gigabytes/sec (Tesla C870) 82 Gigabytes/sec (GTX 280) (90 Gflops/s) (completely bandwidth limited) For comparison: t wice as fast as Cell impl. (arXiv:0804.3654) 20 times faster than CPU implementations
  • 40. GB/s vs Occupancy Tesla C870 GTX 280 GB/s GB/s 45.00 85.00 33.75 63.75 22.50 42.50 11.25 21.25 0 0 ≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0% Occupancy Occupancy Surprise! Very robust to low occupancy
  • 41. Device memory is the bottleneck Coalesced memory accesses crucial Data reordering Quark 1 Quark 2 Quark 3 q21 , q22 , ...q224 q31 , q32 , ...q324 ... q11 , q12 , ...q124 q11 q21 q31 ... q12 q22 q32 ... thread 1 ... thread 0 thread 2
  • 42. Memory coalescing: store even/odd lattices separately
  • 43. When memory access isn’t perfectly coalesced Sometimes float4 arrays can hide latency This global memory read corresponds to a single CUDA instruction In case of coalesce miss, at least 4x data is transfered thread 0 thread 1 thread 2
  • 44. When memory access isn’t perfectly coalesced Binding to textures can help corresponds to a single CUDA instruction This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
  • 45. Regarding textures, there are t wo kinds of memory: Linear array Can be modified in kernel Can only be bound to 1D texture “Cuda array” Can’t be modifed in kernel Gets reordered for 2D, 3D locality Allows various hardware features
  • 46. When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-cur ve This gives 2D locality Wikipedia image
  • 47. Warnings: The effectiveness of float4, textures, depends on the CUDA hardware and driver (!) Certain “magic” access patterns are many times faster than others Testing appears to be necessary
  • 48. Memory bandwidth test Simple kernel Memory access completely coalesced Should be optimal
  • 49. Memory bandwidth test Simple kernel Memory access completely coalesced Bandwidth: 54 Gigabytes / sec (GTX 280, 140 GB/s theoretical!)
  • 50. So why are NVIDIA samples so fast? NVIDIA actually uses 54 Gigabytes / sec 102 Gigabytes / sec (GTX 280, 140 GB/s theoretical)
  • 51. Naive access pattern Step 1 ... ... Block 1 Block 2 Step 2 ... ... Block 1 Block 2
  • 52. Modified access pattern (much more efficient) Step 1 ... ... Block 1 Block 2 ... Step 2 ... Block 1 Block 2
  • 53. CUDA Compiler (LOTS of optimization here) CUDA PTX CUDA machine C code code code Use unofficial CUDA disassembler to view CUDA machine code CUDA disassembly
  • 54. CUDA Disassembler (decuda) foo.cu Compile and save cubin file Disassemble
  • 55. Look how CUDA implements integer division!
  • 56. CUDA provides fast (but imperfect) trigonometry in hardware!
  • 57. The compiler is very aggressive in optimization. It will group memory loads together to minimize latency (snippet from LQCD) Notice: each thread reads 20 floats!