SlideShare ist ein Scribd-Unternehmen logo
1 von 66
GPGPU
Performance & Tools I
Outline
1.   Introduction

2.   Threads

3.   Physical Memory
                                 NOTE:
4.   Logical Memory              A lot of this serves as a recap of
                                 what was covered so far.
5.   Efficient GPU Programming
                                 REMEMBER:
6.   Some Examples               Repetition is the key to remembering things.

7.   CUDA Programming

8.   CUDA Tools Introduction

9.   CUDA Debugger

10. CUDA Visual Profiler
But first…
• Do you believe that there can be a school without exams?

• Do you believe that a 9 year old kid in a South Indian village can
  understand how DNA works?

• Do you believe that schools and universities
  should be changed entirely?



• http://www.ted.com/talks/sugata_mitra_build_a_school_in_the_
  cloud.html
  • Fixing education is a task that requires everyone’s attention…
Most importantly…
• Do you believe that we can learn, driven entirely by
  motivation?
  • If your answer is “NO”, then try to…

  • … Get a new perspective on life…
           …leave your comfort zone!
                             突破自己!
Introduction
Why are we here?
    CPU vs. GPU
•
Combining strengths:
CPU + GPU
 • Can’t we just build a new device that combines the two?

 • Short answer: Some new devices are just that!
   • AMD Fusion
   • Intel MIC (Xeon Phi)



 • Long answer:
   • Take 楊佳玲’s Advanced Computer Architecture class!
Writing Code
Performance vs. Design
• Programmers have two contradictory goals:
  1.     Good Performance (FAST!)
  2.     Good Design (bug-resilient, extensible, easy to use etc…)


• Rule of thumb: Fast code is not pretty

• Example:
  •    Mathematical description –   1 line
  •    Algorithm Pseudocode –       10 lines
  •    Algorithm Code –             20 lines
  •    Optimized Algorithm Code –   50 lines
Writing Code
Common Fallacies
1.    “GPU Programs are always faster than their CPU counterpart”
     • Only if: 1. The problem allows it and 2. you invest a lot of time

2.    “I don’t need a profiler”
     • A profiler helps you analyze performance and find bottlenecks.
     • If you don’t care for performance, do NOT use the GPU.

3.    “I don’t need a debugger”
     • Yes you do.
     • Adding tons of printf’s makes things a lot more difficult (and longer)
     • (Plus, people are lazy)

4.    “I can write bug-free code”
     • No, you can’t – No one can.
Writing Code
A Tale of Two Address Spaces…
• Never forget – In the current architecture:
  • The CPU, and each GPU all have their own address space and code

• We CANNOT access host pointers from device or vice versa

• We CANNOT call host code from the device or vice versa

• We CANNOT access device pointers or call code from different
  devices
          HOST                                        DEVICE

                    M                   PCIe                   M
                    e                                          e
       CPU    BUS   m                                          m
                                                GPU      BUS
                    or                                         or
                    y                                          y
Threads &
Parallel Programming
Why do we need multithreading?
• Most and foremost: Speed!
  • There are some other reasons, but not today…

• Real-life example:
  • Ship 10k containers from 台北 to 香港
  • Question: Do you use 1 very fast ship, or 4 slow ships?

• Program example:
  • Add a scalar to 10k numbers
  • Question: Do you use 1 very fast processor, or 4 slow processors?


• The real issue: Single-unit speed never scales!
    There is no very fast ship or very fast processor
Why do we hate multithreading?
 • Multithreading adds whole new dimensions of complications
   to programming
   • … Communication
   • … Synchronization
   • (… Dead-locks – But generally not on the GPU)



 • Plus, debugging is a lot more complicated
How many Threads?    Kitchen

•
                    T1     T2



                    T3     T4



                     Kitchen


                    T1     T2



                    T3     T4
GPU Threads
Recap
•
Physical Memory
How our computer works
Memory Hierarchy
Smaller is faster!



             & Shared Memory
Processor vs. Memory Speed
• Memory latency keeps getting worse!




  • http://seven-degrees-of-freedom.blogspot.tw/2009/10/latency-
    elephant.html
Logical Memory
How we see memory in our programs
Working with Memory
What is Memory logically?
• Let’s define: Memory = 1D array of bytes
                  0   1      2   3   4     5   6   7     8      9


• An object is a set of 1 or more bytes with a special meaning
  • If the bytes are contiguous, the object is a struct

• Examples of structs:
  •   byte
  •   int
  •   float
  •   pointer !?!
  •   sequence of structs:           int               float*       short


• A pointer is a struct that represents a memory address
  • Basically it’s same as a 1D array index!
Working with Memory
Structs vs. Arrays
• A chunk of contiguous memory is either an array or a struct
   • Array: 1 or more of the same element:
   • Struct: 1 or more of (possibly different) elements:
      • Determine at compile-time

• Don’t make silly assumptions about structs!
   • Compiler might change alignment
   • Compiler might reorder elements

• GPU pointers must be word (4-byte) – aligned

• If the object is only a single element, it can be said to be both:
   • A one-element struct
   • A one-element array
    But don’t overthink it…
Working with Memory
Multi-dimensional Arrays
 • Arrays are often multi-dimensional!
   •   …a line      (1D)
   •   …a rectangle (2D)
   •   …a box       (3D)
   •   … and so on


 • But address space is only 1D!

 • We have to map higher dimensional space into 1D…
   • C and CUDA-C do not allow for multi-dimensional array indices
   • We need to compute indices ourselves
Working with Memory
Row-Major Indexing
•


                 x
                     w=5

                            y




                           h=…
Working with Memory
Summary
•
Efficient GPU
Programming
Must Read!
• If you want to understand the GPU and write fast programs, read these:


  • CUDA C Programming Guide

  • CUDA Best Practices Guide


• All important CUDA documentation is right here:
  • http://docs.nvidia.com/cuda/index.html

• OpenCL documentation:
  • http://developer.amd.com/resources/heterogeneous-computing/opencl-
    zone/
  • http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_Open
    CL_ProgrammingGuide.pdf
Can Read!
Some More Optimization Slides
• The power of ILP:
  • http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf



• Some tips and tricks:
  • http://www.nvidia.com/content/cudazone/download/Advanced_
    CUDA_Training_NVISION08.pdf
ILP Magic
• The GPU facilitates both TLP and ILP
  • Thread-level parallelism
  • Instruction-level parallelism

• ILP means: We can execute multiple instructions at the same time

• Thread does not stall on memory access
  • It only stalls on RAW (Read-After-Write) dependencies:
  a = A[i];                // no stall
  b = B[i];                // no stall
  // …
  c = a * b;               // stall

• Threads can execute multiple arithmetic instructions in parallel
  a = k1 + c * d; // no stall
  b = k2 + f * g; // no stall
Warps occupying a SM
(SM=Streaming Multiprocessor)
• Using the previous example:                           SM Scheduler
                                                  …
  a = A[i];              // no stall            warp6
                                                warp4
  b = B[i];              // no stall
  // …                                                     warp5 warp8

  c = a * b;             // stall


• What happens on a stall?
  • The current warp is placed in the I/O queue and another can run on
    the SM
  • That is why we want as many threads (warps) per SM as possible
  • Also need multiple blocks
     • E.g. Geforce 660 can have 2048 threads/SM but only 1024 threads/block
TLP vs. ILP
What is good Occupancy?
•




                     Ex.: Only 50% processor utilization!
Registers + Shared Memory vs.
Working Set Size
• Shared Memory + Registers must hold current working set of
  all active warps on a SM
  • In other words: Shared Memory + Registers must hold all (or most
    of the) data that all of the threads currently and most often need

• More threads = better TLP = less actual stalls

• More threads = less space for working set
  • Less registers/thread & shared memory/thread

• If Shm + Registers too small for working set, must use out-of-
  core method
  • For example: External merge sort
  • http://en.wikipedia.org/wiki/External_sorting
Memory Coalescing and
Bank Conflicts
• VERY big bottleneck!



• See the professor’s slides



• Also, see the   Must Read! section
OOP vs. DOP
• Array-of-Struct vs. Struct-of-Array (AoS vs. SoA)

• You probably all have heard of Object-Oriented Programming
  • Idealistic OOP is slow
  • OOP groups data (and code) into logical chunks (structs)
  • OOP generally ignores temporal locality of data

• Good performance requires: Data-Oriented Programming
  • http://research.scee.net/files/presentations/gcapaustralia09/Pitf
    alls_of_Object_Oriented_Programming_GCAP_09.pdf

• Bundle data together that is accessed at about the same time!
  • I.e. group data in a way that maximizes temporal locality
Streams – Pipelining
memcpy vs. computation
•
           Why? Because:
           memcpy between host and device is a huge bottleneck!
Look beyond the code
E.g.
        int a = …, wA = …;
        int tx = threadIdx.x, ty = threadIdx.y;
        __shared__ int A[128];
        As[ty][tx] = A[a + wA * ty + tx];



• Which resources does the line of code use?
  • Several registers and constant cache
       • Variables and constants
       • Intermediate results


  • Memory (shared or global)
       • Reads from    A     (shared)
       • Writes to     As    (maybe global)
Where to get the numbers?
• For actual NVIDIA device properties, check CUDA programming
  guide Appendix F, Table 10
  • (The appendix lists a lot of info complementary to device query)
  • Note: Every device has a max Compute Capability (CC) version
      • The CC version of your device decides which features it supports
  • More info can be found in each CC section (all in Appendix F)
      • E.g. # warp schedulers (2 for CC 2.x; 4 for CC 3.x)
         • Dual-issue since CC 2.1


• For comparison of device stats consider NVIDIA
  • http://en.wikipedia.org/wiki/GeForce_600_Series#Products
  • etc…

• E.g. Memory latency (from section 5.2.3 of the Progr. Guide)
  • “400 to 800 clock cycles for devices of compute capability 1.x and 2.x
    and about 200 to 400 clock cycles for devices of compute capability
    3.x”
Other Tuning Tips
• The most important contributor to performance is the algorithm!

• Block size always a multiple of Warp size (32 on NVIDIA, 64 on AMD)!

• There is a lot more…
  • Page-lock Host Memory
  • Etc…

• Read all the references mentioned in this talk and you’ll get it.
Writing the Code…
• Do not ask the TA via email to help you with the code!

• Use the forum instead
  • Other people probably have similar questions!


• The TA (this guy) will answer all forum posts to his best judgment

• Other students can also help!

• Just one rule: Do not share your actual code!
Some Examples
Example 1
    Scalar-Vector Multiplication
•




                                   Why?
Example 2
A typical CUDA kernel…
Shared memory declarations

Repeat:
     Copy some input to shared memory (shm)

     __syncthreads();

     Use shm data for actual computation

     __syncthreads();

Write to global memory
Example 3
 Median Filter
• No code (sorry!), but here are some hints…

• Use shared memory!
  • The code skeleton looks like Example 2
  • Remember: All threads in a block can access the same shared memory
• Use 2D blocks!
  • To get increased shared memory data re-use
• Each thread computes one output pixel!

• Use the debugger!
• Use the profiler!

• Some more hints are in the homework description…
Many More Examples…
• Check out the NVIDIA CUDA and AMD APP SDK samples

• Some of them come with documents, explaining:
  • The parallel algorithm (and how it was developed)
  • Exactly how much speed up was gained from each optimization step

• CUDA 5 samples with docs:
  •   simpleMultiCopy
  •   Mandelbrot
  •   Eigenvalue
  •   recursiveGaussian
  •   sobelFilter
  •   smokeParticles
  •   BlackScholes
  •   …and many more…
CUDA Tools
Documentation

• Online Documentation for NSIGHT 3
  • http://http.developer.nvidia.com/NsightVisualStudio/3.0/Documentatio
    n/UserGuide/HTML/Nsight_Visual_Studio_Edition_User_Guide.htm




• Again: Read the documents from the    Must read! section
CUDA Debugger
VS 2010 & NSIGHT
Works with Eclipse and VS 2010
(no VS 2012 support yet)
NSIGHT 3 and 2.2
  Setup
• Get NSIGHT 3.0:
  • Go to: https://developer.nvidia.com/nvidia-nsight-visual-studio-edition
  • Register (Create an account)
  • Login
     • https://developer.nvidia.com/rdp/nsight-visual-studio-edition-early-access
  • Download NSIGHT 3
     • Works for CUDA 5
     • Also has an OpenGL debugger and more


• Alternative: Get NSIGHT 2.2
  • No login required
  • Only works for CUDA 4
CUDA Debugger
Some References
• http://http.developer.nvidia.com/NsightVisualStudio/3.0/Doc
  umentation/UserGuide/HTML/Content/Debugging_CUDA_Ap
  plication.htm

• https://www.youtube.com/watch?v=FLQuqXhlx40
  • A bit outdated, but still very useful


• etc…
Visual Studio 2010 & NSIGHT
• System Info
Visual Studio 2010 & NSIGHT
1. Enable Debugging
  • NOTE: CPU and GPU debugging are entirely separated at this point
  • You must set everything explicitly for GPU
  • When GPU debug mode is enabled GPU kernels will run a lot slower!
Visual Studio 2010 & NSIGHT
2. Set breakpoint in code:




3. Start CUDA Debugger
  • DO NOT USE THE VS DEBUGGER (F5) for CUDA debugging
Visual Studio 2010 & NSIGHT
4. Step through the code
  • Step Into (F11)
  • Step Over (F10)
  • Step Out (Shift + F11)



5. Open the corresponding windows
Visual Studio 2010 & NSIGHT
6. Inspect everything…
Visual Studio 2010 & NSIGHT
Conditions                    Remember?
• Right-Click on breakpoint



• Result:
Visual Studio 2010 & NSIGHT
• Move between warps
Visual Studio 2010 & NSIGHT
• Select a specific thread
Visual Studio 2010 & NSIGHT
• Inspect Thread and Warp State




  • Lists state information of all Threads. E.g.:
     • Id, Block, Warp, File, Line, PC (Program Counter), etc…
     • Barrier information (is warp currently waiting for sync?)
     • Active Mask
        • Which threads of the thread’s warp are currently running
        • One bit per thread
        • Prof. Chen will cover warp divergence later in the class
Visual Studio 2010 & NSIGHT
• Inspect Memory
  • Can use Drag   & Drop!




                                 Why is
                                 1 == 00 00 80 3f?

                             Floating Point representation!
CUDA Profilers
Understand your program’s performance profiles!
Comprehensive References
• Great Overview:
  • http://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_low
    res.pdf



• http://developer.download.nvidia.com/GTC/PDF/GTC2012/Pr
  esentationPDF/S0419B-GTC2012-Profiling-Profiling-Tools.pdf
NVIDIA Visual Profiler
TODO…
• Great Tool!


• Chance for bonus points:
• Put together a comprehensive and easily understandable
  tutorial!

• We will cast a vote!
• The best tutorial gets bonus points!
nvprof
TODO
• Text-based profiler
  • For everyone without a GUI

• Maybe also bonus points?

• We will post more details on the forum…
GTC – More about the GPU
• NVIDIA’s annual GPU Technology Conference hosts many talks
  available online

• This year’s GTC is in progress RIGHT NOW!
  • http://www.gputechconf.com/page/sessions.html


• Of course it’s a big advertisement campaign for NVIDIA
  • But it also has a lot of interesting stuff!
The End
Any Questions?
Update (1)
1. Compiler Options
nvcc (NVIDIA Cuda Compiler)有很多option可以玩玩看。
建議各位把nvcc的help印到一個檔案然後每次開始寫code之前多多參考:
nvcc --help > nvcchelp.txt

2. Compute Capability 1.3
測試系統很老所以他跑的CUDA版本跟大部分的人家裡的CUDA版本應該不一樣。
你們如果家裡可以pass但是批改娘雖然不讓你們pass的話,這裡就有一個很好
的解決方法:用"-arch=sm_13"萊compile跟測試系統會跑一樣的machine code:
nvcc -arch=sm_13

3. Register Pressure & Register Usage
這個stackoverflow的文章就是談nvcc跟register usage的一些事情:
[url]http://stackoverflow.com/questions/9723431/tracking-down-cuda-kernel-
register-usage[/url]
如果跟nvcc講-Xptxas="-v"的話,他就會跟你講每一個thread到底在用幾個
register。

我的中文好差。請各位多多指教。
Update (2)
• Occupancy Calculator!
  • http://developer.download.nvidia.com/compute/cuda/CUDA_Oc
    cupancy_calculator.xls

Weitere ähnliche Inhalte

Was ist angesagt?

Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
Lock-free algorithms for Kotlin Coroutines
Lock-free algorithms for Kotlin CoroutinesLock-free algorithms for Kotlin Coroutines
Lock-free algorithms for Kotlin CoroutinesRoman Elizarov
 
Expert JavaScript Programming
Expert JavaScript ProgrammingExpert JavaScript Programming
Expert JavaScript ProgrammingYoshiki Shibukawa
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spacesluccastera
 
What's the "right" PHP Framework?
What's the "right" PHP Framework?What's the "right" PHP Framework?
What's the "right" PHP Framework?Barry Jones
 
High Performance Ruby - E4E Conference 2013
High Performance Ruby - E4E Conference 2013High Performance Ruby - E4E Conference 2013
High Performance Ruby - E4E Conference 2013Charles Nutter
 
TRICK2013 Results
TRICK2013 ResultsTRICK2013 Results
TRICK2013 Resultsmametter
 
Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0Jeongkyu Shin
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devTomek Borek
 
Invokedynamic in 45 Minutes
Invokedynamic in 45 MinutesInvokedynamic in 45 Minutes
Invokedynamic in 45 MinutesCharles Nutter
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowMateuszSzczyrzyca
 
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張Unite2017Tokyo
 
Deconstruct 2017: All programmers MUST learn C and Assembly
Deconstruct 2017: All programmers MUST learn C and AssemblyDeconstruct 2017: All programmers MUST learn C and Assembly
Deconstruct 2017: All programmers MUST learn C and Assemblyice799
 
Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Koan-Sin Tan
 

Was ist angesagt? (20)

Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Lock-free algorithms for Kotlin Coroutines
Lock-free algorithms for Kotlin CoroutinesLock-free algorithms for Kotlin Coroutines
Lock-free algorithms for Kotlin Coroutines
 
Expert JavaScript Programming
Expert JavaScript ProgrammingExpert JavaScript Programming
Expert JavaScript Programming
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spaces
 
What's the "right" PHP Framework?
What's the "right" PHP Framework?What's the "right" PHP Framework?
What's the "right" PHP Framework?
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
High Performance Ruby - E4E Conference 2013
High Performance Ruby - E4E Conference 2013High Performance Ruby - E4E Conference 2013
High Performance Ruby - E4E Conference 2013
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
TRICK2013 Results
TRICK2013 ResultsTRICK2013 Results
TRICK2013 Results
 
Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy dev
 
Invokedynamic in 45 Minutes
Invokedynamic in 45 MinutesInvokedynamic in 45 Minutes
Invokedynamic in 45 Minutes
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
 
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
 
Deconstruct 2017: All programmers MUST learn C and Assembly
Deconstruct 2017: All programmers MUST learn C and AssemblyDeconstruct 2017: All programmers MUST learn C and Assembly
Deconstruct 2017: All programmers MUST learn C and Assembly
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 

Andere mochten auch

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 

Andere mochten auch (20)

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 

Ähnlich wie Gpgpu intro

Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationShivam Mitra
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Peter Hlavaty
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java codeAttila Balazs
 
C# Async/Await Explained
C# Async/Await ExplainedC# Async/Await Explained
C# Async/Await ExplainedJeremy Likness
 
Scratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieScratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieESUG
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency TradingViktor Sovietov
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 
Modern Java Concurrency (OSCON 2012)
Modern Java Concurrency (OSCON 2012)Modern Java Concurrency (OSCON 2012)
Modern Java Concurrency (OSCON 2012)Martijn Verburg
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processorZeeshan Aslam
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debuggingSyed Zaid Irshad
 
Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Danny Mulligan
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIDan Holmes
 

Ähnlich wie Gpgpu intro (20)

201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
201811xx foredrag c_cpp
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess Communication
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
C# Async/Await Explained
C# Async/Await ExplainedC# Async/Await Explained
C# Async/Await Explained
 
Scratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieScratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry Pie
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency Trading
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
Modern Java Concurrency (OSCON 2012)
Modern Java Concurrency (OSCON 2012)Modern Java Concurrency (OSCON 2012)
Modern Java Concurrency (OSCON 2012)
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debugging
 
Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPI
 

Gpgpu intro

  • 2. Outline 1. Introduction 2. Threads 3. Physical Memory NOTE: 4. Logical Memory A lot of this serves as a recap of what was covered so far. 5. Efficient GPU Programming REMEMBER: 6. Some Examples Repetition is the key to remembering things. 7. CUDA Programming 8. CUDA Tools Introduction 9. CUDA Debugger 10. CUDA Visual Profiler
  • 3. But first… • Do you believe that there can be a school without exams? • Do you believe that a 9 year old kid in a South Indian village can understand how DNA works? • Do you believe that schools and universities should be changed entirely? • http://www.ted.com/talks/sugata_mitra_build_a_school_in_the_ cloud.html • Fixing education is a task that requires everyone’s attention…
  • 4. Most importantly… • Do you believe that we can learn, driven entirely by motivation? • If your answer is “NO”, then try to… • … Get a new perspective on life… …leave your comfort zone! 突破自己!
  • 6. Why are we here? CPU vs. GPU •
  • 7. Combining strengths: CPU + GPU • Can’t we just build a new device that combines the two? • Short answer: Some new devices are just that! • AMD Fusion • Intel MIC (Xeon Phi) • Long answer: • Take 楊佳玲’s Advanced Computer Architecture class!
  • 8. Writing Code Performance vs. Design • Programmers have two contradictory goals: 1. Good Performance (FAST!) 2. Good Design (bug-resilient, extensible, easy to use etc…) • Rule of thumb: Fast code is not pretty • Example: • Mathematical description – 1 line • Algorithm Pseudocode – 10 lines • Algorithm Code – 20 lines • Optimized Algorithm Code – 50 lines
  • 9. Writing Code Common Fallacies 1. “GPU Programs are always faster than their CPU counterpart” • Only if: 1. The problem allows it and 2. you invest a lot of time 2. “I don’t need a profiler” • A profiler helps you analyze performance and find bottlenecks. • If you don’t care for performance, do NOT use the GPU. 3. “I don’t need a debugger” • Yes you do. • Adding tons of printf’s makes things a lot more difficult (and longer) • (Plus, people are lazy) 4. “I can write bug-free code” • No, you can’t – No one can.
  • 10. Writing Code A Tale of Two Address Spaces… • Never forget – In the current architecture: • The CPU, and each GPU all have their own address space and code • We CANNOT access host pointers from device or vice versa • We CANNOT call host code from the device or vice versa • We CANNOT access device pointers or call code from different devices HOST DEVICE M PCIe M e e CPU BUS m m GPU BUS or or y y
  • 12. Why do we need multithreading? • Most and foremost: Speed! • There are some other reasons, but not today… • Real-life example: • Ship 10k containers from 台北 to 香港 • Question: Do you use 1 very fast ship, or 4 slow ships? • Program example: • Add a scalar to 10k numbers • Question: Do you use 1 very fast processor, or 4 slow processors? • The real issue: Single-unit speed never scales! There is no very fast ship or very fast processor
  • 13. Why do we hate multithreading? • Multithreading adds whole new dimensions of complications to programming • … Communication • … Synchronization • (… Dead-locks – But generally not on the GPU) • Plus, debugging is a lot more complicated
  • 14. How many Threads? Kitchen • T1 T2 T3 T4 Kitchen T1 T2 T3 T4
  • 16. Physical Memory How our computer works
  • 17. Memory Hierarchy Smaller is faster! & Shared Memory
  • 18. Processor vs. Memory Speed • Memory latency keeps getting worse! • http://seven-degrees-of-freedom.blogspot.tw/2009/10/latency- elephant.html
  • 19. Logical Memory How we see memory in our programs
  • 20. Working with Memory What is Memory logically? • Let’s define: Memory = 1D array of bytes 0 1 2 3 4 5 6 7 8 9 • An object is a set of 1 or more bytes with a special meaning • If the bytes are contiguous, the object is a struct • Examples of structs: • byte • int • float • pointer !?! • sequence of structs: int float* short • A pointer is a struct that represents a memory address • Basically it’s same as a 1D array index!
  • 21. Working with Memory Structs vs. Arrays • A chunk of contiguous memory is either an array or a struct • Array: 1 or more of the same element: • Struct: 1 or more of (possibly different) elements: • Determine at compile-time • Don’t make silly assumptions about structs! • Compiler might change alignment • Compiler might reorder elements • GPU pointers must be word (4-byte) – aligned • If the object is only a single element, it can be said to be both: • A one-element struct • A one-element array But don’t overthink it…
  • 22. Working with Memory Multi-dimensional Arrays • Arrays are often multi-dimensional! • …a line (1D) • …a rectangle (2D) • …a box (3D) • … and so on • But address space is only 1D! • We have to map higher dimensional space into 1D… • C and CUDA-C do not allow for multi-dimensional array indices • We need to compute indices ourselves
  • 23. Working with Memory Row-Major Indexing • x w=5 y h=…
  • 26. Must Read! • If you want to understand the GPU and write fast programs, read these: • CUDA C Programming Guide • CUDA Best Practices Guide • All important CUDA documentation is right here: • http://docs.nvidia.com/cuda/index.html • OpenCL documentation: • http://developer.amd.com/resources/heterogeneous-computing/opencl- zone/ • http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_Open CL_ProgrammingGuide.pdf
  • 27. Can Read! Some More Optimization Slides • The power of ILP: • http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf • Some tips and tricks: • http://www.nvidia.com/content/cudazone/download/Advanced_ CUDA_Training_NVISION08.pdf
  • 28. ILP Magic • The GPU facilitates both TLP and ILP • Thread-level parallelism • Instruction-level parallelism • ILP means: We can execute multiple instructions at the same time • Thread does not stall on memory access • It only stalls on RAW (Read-After-Write) dependencies: a = A[i]; // no stall b = B[i]; // no stall // … c = a * b; // stall • Threads can execute multiple arithmetic instructions in parallel a = k1 + c * d; // no stall b = k2 + f * g; // no stall
  • 29. Warps occupying a SM (SM=Streaming Multiprocessor) • Using the previous example: SM Scheduler … a = A[i]; // no stall warp6 warp4 b = B[i]; // no stall // … warp5 warp8 c = a * b; // stall • What happens on a stall? • The current warp is placed in the I/O queue and another can run on the SM • That is why we want as many threads (warps) per SM as possible • Also need multiple blocks • E.g. Geforce 660 can have 2048 threads/SM but only 1024 threads/block
  • 30. TLP vs. ILP What is good Occupancy? • Ex.: Only 50% processor utilization!
  • 31. Registers + Shared Memory vs. Working Set Size • Shared Memory + Registers must hold current working set of all active warps on a SM • In other words: Shared Memory + Registers must hold all (or most of the) data that all of the threads currently and most often need • More threads = better TLP = less actual stalls • More threads = less space for working set • Less registers/thread & shared memory/thread • If Shm + Registers too small for working set, must use out-of- core method • For example: External merge sort • http://en.wikipedia.org/wiki/External_sorting
  • 32. Memory Coalescing and Bank Conflicts • VERY big bottleneck! • See the professor’s slides • Also, see the Must Read! section
  • 33. OOP vs. DOP • Array-of-Struct vs. Struct-of-Array (AoS vs. SoA) • You probably all have heard of Object-Oriented Programming • Idealistic OOP is slow • OOP groups data (and code) into logical chunks (structs) • OOP generally ignores temporal locality of data • Good performance requires: Data-Oriented Programming • http://research.scee.net/files/presentations/gcapaustralia09/Pitf alls_of_Object_Oriented_Programming_GCAP_09.pdf • Bundle data together that is accessed at about the same time! • I.e. group data in a way that maximizes temporal locality
  • 34. Streams – Pipelining memcpy vs. computation • Why? Because: memcpy between host and device is a huge bottleneck!
  • 35. Look beyond the code E.g. int a = …, wA = …; int tx = threadIdx.x, ty = threadIdx.y; __shared__ int A[128]; As[ty][tx] = A[a + wA * ty + tx]; • Which resources does the line of code use? • Several registers and constant cache • Variables and constants • Intermediate results • Memory (shared or global) • Reads from A (shared) • Writes to As (maybe global)
  • 36. Where to get the numbers? • For actual NVIDIA device properties, check CUDA programming guide Appendix F, Table 10 • (The appendix lists a lot of info complementary to device query) • Note: Every device has a max Compute Capability (CC) version • The CC version of your device decides which features it supports • More info can be found in each CC section (all in Appendix F) • E.g. # warp schedulers (2 for CC 2.x; 4 for CC 3.x) • Dual-issue since CC 2.1 • For comparison of device stats consider NVIDIA • http://en.wikipedia.org/wiki/GeForce_600_Series#Products • etc… • E.g. Memory latency (from section 5.2.3 of the Progr. Guide) • “400 to 800 clock cycles for devices of compute capability 1.x and 2.x and about 200 to 400 clock cycles for devices of compute capability 3.x”
  • 37. Other Tuning Tips • The most important contributor to performance is the algorithm! • Block size always a multiple of Warp size (32 on NVIDIA, 64 on AMD)! • There is a lot more… • Page-lock Host Memory • Etc… • Read all the references mentioned in this talk and you’ll get it.
  • 38. Writing the Code… • Do not ask the TA via email to help you with the code! • Use the forum instead • Other people probably have similar questions! • The TA (this guy) will answer all forum posts to his best judgment • Other students can also help! • Just one rule: Do not share your actual code!
  • 40. Example 1 Scalar-Vector Multiplication • Why?
  • 41. Example 2 A typical CUDA kernel… Shared memory declarations Repeat: Copy some input to shared memory (shm) __syncthreads(); Use shm data for actual computation __syncthreads(); Write to global memory
  • 42. Example 3 Median Filter • No code (sorry!), but here are some hints… • Use shared memory! • The code skeleton looks like Example 2 • Remember: All threads in a block can access the same shared memory • Use 2D blocks! • To get increased shared memory data re-use • Each thread computes one output pixel! • Use the debugger! • Use the profiler! • Some more hints are in the homework description…
  • 43. Many More Examples… • Check out the NVIDIA CUDA and AMD APP SDK samples • Some of them come with documents, explaining: • The parallel algorithm (and how it was developed) • Exactly how much speed up was gained from each optimization step • CUDA 5 samples with docs: • simpleMultiCopy • Mandelbrot • Eigenvalue • recursiveGaussian • sobelFilter • smokeParticles • BlackScholes • …and many more…
  • 45. Documentation • Online Documentation for NSIGHT 3 • http://http.developer.nvidia.com/NsightVisualStudio/3.0/Documentatio n/UserGuide/HTML/Nsight_Visual_Studio_Edition_User_Guide.htm • Again: Read the documents from the Must read! section
  • 46. CUDA Debugger VS 2010 & NSIGHT Works with Eclipse and VS 2010 (no VS 2012 support yet)
  • 47. NSIGHT 3 and 2.2 Setup • Get NSIGHT 3.0: • Go to: https://developer.nvidia.com/nvidia-nsight-visual-studio-edition • Register (Create an account) • Login • https://developer.nvidia.com/rdp/nsight-visual-studio-edition-early-access • Download NSIGHT 3 • Works for CUDA 5 • Also has an OpenGL debugger and more • Alternative: Get NSIGHT 2.2 • No login required • Only works for CUDA 4
  • 48. CUDA Debugger Some References • http://http.developer.nvidia.com/NsightVisualStudio/3.0/Doc umentation/UserGuide/HTML/Content/Debugging_CUDA_Ap plication.htm • https://www.youtube.com/watch?v=FLQuqXhlx40 • A bit outdated, but still very useful • etc…
  • 49. Visual Studio 2010 & NSIGHT • System Info
  • 50. Visual Studio 2010 & NSIGHT 1. Enable Debugging • NOTE: CPU and GPU debugging are entirely separated at this point • You must set everything explicitly for GPU • When GPU debug mode is enabled GPU kernels will run a lot slower!
  • 51. Visual Studio 2010 & NSIGHT 2. Set breakpoint in code: 3. Start CUDA Debugger • DO NOT USE THE VS DEBUGGER (F5) for CUDA debugging
  • 52. Visual Studio 2010 & NSIGHT 4. Step through the code • Step Into (F11) • Step Over (F10) • Step Out (Shift + F11) 5. Open the corresponding windows
  • 53. Visual Studio 2010 & NSIGHT 6. Inspect everything…
  • 54. Visual Studio 2010 & NSIGHT Conditions Remember? • Right-Click on breakpoint • Result:
  • 55. Visual Studio 2010 & NSIGHT • Move between warps
  • 56. Visual Studio 2010 & NSIGHT • Select a specific thread
  • 57. Visual Studio 2010 & NSIGHT • Inspect Thread and Warp State • Lists state information of all Threads. E.g.: • Id, Block, Warp, File, Line, PC (Program Counter), etc… • Barrier information (is warp currently waiting for sync?) • Active Mask • Which threads of the thread’s warp are currently running • One bit per thread • Prof. Chen will cover warp divergence later in the class
  • 58. Visual Studio 2010 & NSIGHT • Inspect Memory • Can use Drag & Drop! Why is 1 == 00 00 80 3f? Floating Point representation!
  • 59. CUDA Profilers Understand your program’s performance profiles!
  • 60. Comprehensive References • Great Overview: • http://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_low res.pdf • http://developer.download.nvidia.com/GTC/PDF/GTC2012/Pr esentationPDF/S0419B-GTC2012-Profiling-Profiling-Tools.pdf
  • 61. NVIDIA Visual Profiler TODO… • Great Tool! • Chance for bonus points: • Put together a comprehensive and easily understandable tutorial! • We will cast a vote! • The best tutorial gets bonus points!
  • 62. nvprof TODO • Text-based profiler • For everyone without a GUI • Maybe also bonus points? • We will post more details on the forum…
  • 63. GTC – More about the GPU • NVIDIA’s annual GPU Technology Conference hosts many talks available online • This year’s GTC is in progress RIGHT NOW! • http://www.gputechconf.com/page/sessions.html • Of course it’s a big advertisement campaign for NVIDIA • But it also has a lot of interesting stuff!
  • 65. Update (1) 1. Compiler Options nvcc (NVIDIA Cuda Compiler)有很多option可以玩玩看。 建議各位把nvcc的help印到一個檔案然後每次開始寫code之前多多參考: nvcc --help > nvcchelp.txt 2. Compute Capability 1.3 測試系統很老所以他跑的CUDA版本跟大部分的人家裡的CUDA版本應該不一樣。 你們如果家裡可以pass但是批改娘雖然不讓你們pass的話,這裡就有一個很好 的解決方法:用"-arch=sm_13"萊compile跟測試系統會跑一樣的machine code: nvcc -arch=sm_13 3. Register Pressure & Register Usage 這個stackoverflow的文章就是談nvcc跟register usage的一些事情: [url]http://stackoverflow.com/questions/9723431/tracking-down-cuda-kernel- register-usage[/url] 如果跟nvcc講-Xptxas="-v"的話,他就會跟你講每一個thread到底在用幾個 register。 我的中文好差。請各位多多指教。
  • 66. Update (2) • Occupancy Calculator! • http://developer.download.nvidia.com/compute/cuda/CUDA_Oc cupancy_calculator.xls