SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Out-of-Core Programming with NVIDIA’s CUDA

               Gene Cooperman
       High Performance Computing Lab
  College of Computer and Information Science
             Northeastern University
          Boston, Massachusetts 02115
                      USA
               gene@ccs.neu.edu
Pencil and Paper Calculation


• GeForce 8800:
   – 16 CPU chips/Streaming Multiprocessors (SMs),
     8 Cores per chip : 128 cores
   – Aggregate bandwidth to off-chip global memory: 86.4 GB/s (optimal)
   – Average bandwidth to global memory per core: 0.67 GB/s
• Motherboard
   – 4 CPU cores
   – About 10 GB/s bandwidth to main RAM
   – Average bandwidth to RAM per core: 2.5 GB/s
Keeping Pipe to Memory Flowing

• Thread block: threads on a single chip
• Thread block organized into warps
• Warp of 32 threads required (minimize overhead of switching thread blocks)
• Highest bandwidth when all SMs executing same code
Memory-Bound Computations

• So, how much data can we keep in the SMs before it overflows?
• 16 KB/SM −→ 256 KB total cache
• Any computation with an active working set of more than 256 KB risks being memory
  bound.
Memory Bandwidth in Numbers




(Thanks to Kapil Arya and Viral Gupta; Illustrative for trends, only)
X-Axis: number of thread blocks
Y-Axis: bandwidth (MB/s)
Different curves: number of threads per thread block.
Is Life Any Better Back on the Motherboard?

 • Up to 10 GB/s bandwidth to motherboard (perhaps five times slower than NVIDIA in
   practice)
 • Four cores competing for bandwidth
 • Cache of at least 1 MB, and possibly much more (e.g L3 cache)
 • Conclusion: Less pressure on memory, but similar order of magnitude
Is Life Any Better between CPU and Disk?

 • Between 0.05 GB and 0.1 GB bandwidth to disk
 • Four cores competing for bandwidth
 • Cache consists of 4 GB or more of RAM
 • Conclusion: huge pressure on memory (but RAM as cache is large)
Our Solution

 • Disk is the New RAM
 • Bandwidth of Disk: ˜ 100 MB/s
 • Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s
 • Bandwidth of RAM: approximately 5 GB/s


 • Conclusion:
   1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly
      idle disk space, is a good approximation to a shared memory computer with 200
      CPU cores and a single subsystem with 25 TB of shared memory.
      (The arguments also work for a SAN with multiple access nodes, but we consider
      local disks for simplicity.)
   2. The disks of a cluster can serve as if they were RAM.
   3. The traditional RAM can then serve as if it were cache.
Our Solution

 • Disk is the New RAM
 • Bandwidth of Disk: ˜ 100 MB/s
 • Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s
 • Bandwidth of RAM: approximately 5 GB/s


 • Conclusion:
   1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly
      idle disk space, is a good approximation to a shared memory computer with 200
      CPU cores and a single subsystem with 25 TB of shared memory.
      (The arguments also work for a SAN with multiple access nodes, but we consider
      local disks for simplicity.)
   2. The disks of a cluster can serve as if they were RAM.
   3. The traditional RAM can then serve as if it were cache.
What About Disk Latency?

• Unfortunately, putting 50 disks on it, doesn’t speed up the latency.
• So, re-organize the data structures and low-level algorithms.
• Our group has five years of case histories applying this computational algebra — but
  each case requires months of development and debugging.
• We’re now developing both higher level abstractions for run-time libraries, and a
  language extension that will make future development much faster.
Applications Benefiting from Disk-Based Parallel Computation

      Discipline                               Example Application
 1.   Verification                       Symbolic Computation using BDDs
 2.   Verification                             Explicit State Verification
 3.   Comp. Group Theory        Search and Enumeration in Mathematical Structures
 4.   Coding Theory                            Search for New Codes
 5.   Security                           Exhaustive Search for Passwords
 6.   Semantic Web             RDF query language; OWL Web Ontology Language
 7,   Artificial Intelligence                           Planning
 8.   Proteomics                    Protein folding via a kinetic network model
 9.   Operations Research                        Branch and Bound
10.   Operations Research      Integer Programming (applic. of Branch-and-Bound)
11.   Economics                               Dynamic Programming
12.   Numerical Analysis       ATLAS, PHiPAC, FFTW, and other adaptive software
13.   Engineering                                   Sensor Data
14.   A.I. Search                                  Rubik’s Cube
Central Claim

Suppose one had a single computer with 10 terabytes of RAM and 200 CPU cores. Does
that satisfy your need for computers with more RAM?



CLAIM: A computer cluster of 32 quad-core nodes, each with a 500 GB local disk, is
a good approximation of the above computer. (The arguments also work for a SAN with
multiple access nodes, but we discuss local disks for simplicity.)
When is a cluster like a 10 TB shared memory computer?

  • Assume 200 GB/node of free disk space
  • Assume 50 nodes,
  • The bandwidth of 50 disks is 50 × 100MB/s = 5GB/s.
  • The bandwidth of a single RAM subsystem is about 5GB/s.

CLAIM: You probably have the 10 TB of temporary disk space lying idle on your own
recent-model computer cluster. You just didn’t know it.
(Or were you just not telling other people about the space, so you could use if for yourself?)




The economics of disks are such that one saves very little by buying less than 500 GB
disk per node. It’s common to buy the 500 GB disk, and reserve the extra space for
expansion.
When is a cluster NOT like a 10 TB shared memory computer?

1. We require a parallel program. (We must access the local disks of many cluster nodes
   in parallel.)
2. The latency problem of disk.
3. Can the network keep up with the disk?
When is a cluster NOT like a 10 TB shared memory computer?

. . . and why doesn’t it matter for our purposes?

 • ANSWER 1: We’ve used this architecture, and it works for us.
 • We’ve developed solutions for a series of algorithmically simple computational kernels
   from computational algebra — especially mathematical group theory. All of the
   following computations completed in less than one cluster-week on a cluster of 60 nodes
   or less.
     – Construction of Thompson Sporadic Simple Group (2003)
       2 gigabytes (temporary space), 1.4 × 108 states, 4 bytes per state
     – Construction of Baby Monster Sporadic Simple Group (2006)
       6 terabytes (temporary space), 1.4 × 1010 states, 12 bytes per state
     – Condensation of Fi23 Sporadic Simple Group (2007)
       400 GB (temporary space) 1.2 × 1010 states, 30 bytes per state
       (larger condensation for J4 now in progress)
     – Rubik’s Cube: 26 Moves Suffice to Solve Rubik’s Cube (2007)
       7 terabytes (temporary space), 1012 states, 6 bytes per state
     – In progress: coset enumeration (pointer-chasing: similar to algorithm for converting
       NFA to DFA (finite automata)).
When is a cluster NOT like a 10 TB shared memory computer?

1. We require a parallel program.
2. The latency problem of disk.
3. Can the network keep up with the disk?
When is a cluster NOT like a 10 TB shared memory computer?

. . . and why doesn’t it matter for our purposes?

 1. We require a parallel program. (We must access the local disks of many nodes in
    parallel.)
     • Our bet (still to be proved): Any sequential algorithm that already creates gigabytes
       of RAM-based data should have a way to create that data in parallel.
 2. The latency problem of disk. Solutions exist:
   (a) For duplicates on frontier in state space search: Delayed Duplicate Detection
       implies waiting until many nodes of the next frontier (and duplicates from previous
       iterations) have been discovered. Then remove duplicates.
   (b) For hash tables, wait until there are millions of hash queries. Then sort on the hash
       index, and scan the disk to resolve queries.
   (c) For pointer-chasing, wait until millions of pointers are available for chasing. Then
       sort and scan the disk to dereference pointers.
   (d) For tracing strings, with each string being a lookup, wait until millions of strings are
       available. Then ....
 3. Can the network keep up with the disk?
When is a cluster NOT like a 10 TB shared memory computer?

. . . and why doesn’t it matter for our purposes?

 1. We require a parallel program. (We must access the local disks of many nodes in
    parallel.)
 2. The latency problem of disk.
 3. Can the network keep up with the disk?
    (In our experience to date, the network does keep up. Here are some reasons why it
    seems to just work.)
     • The point-to-point bandwidth of Gigabit Ethernet is about 100 MB/s. The bandwidth
       of disk is about 100 MB/s. As long as the aggregate bandwidth of network can keep
       up, everything is fine.
     • Researchers already face the issue of aggregate network bandwidth in RAM-based
       programs. The disk is slower than RAM. So, probably traditional parallel programs
       can cope.
Applications from Computational Group Theory (2003–2007)

                    Space         State     Total
     Group          Size          Size    Storage
                    1.17 × 1010 100 bytes
     Fischer Fi23                           1 TB
     “Baby Monster” 1.35 × 1010 548 bytes   7 TB
                    1.31 × 1011 64 bytes
     Janko J4                               8 TB

                 (joint with Eric Robinson)
History of Rubik’s Cube


 • Invented in late 1970s in Hungary.
 • In 1982, in Cubik Math, Singmaster and Frey conjectured:
      No one knows how many moves would be needed for “God’s Algorithm”
      assuming he always used the fewest moves required to restore the cube. It
      has been proven that some patterns must exist that require at least seventeen
      moves to restore but no one knows what those patterns may be. Experienced
      group theorists have conjectured that the smallest number of moves which would
      be sufficient to restore any scrambled pattern — that is, the number of moves
      required for “God’s Algorithm” — is probably in the low twenties.
 • Current Best Guess: 20 moves suffice
    – States needing 20 moves are known
History of Rubik’s Cube (cont.)


 • Invented in late 1970s in Hungary.
 • 1982: “God’s Number” (number of moves needed) was known by authors of conjecture
   to be between 17 and 52.
 • 1990: C., Finkelstein, and Sarawagi showed 11 moves suffice for Rubik’s 2 × 2 × 2 cube
   (corner cubies only)
 • 1995: Reid showed 29 moves suffice (lower bound of 20 already known)
 • 2006: Radu showed 27 moves suffice
 • 2007 Kunkle and C. showed 26 moves suffice
 • 2008 Rockiki showed 22 moves suffice (using idle resources at Sony Pictures)
Large-Memory Apps: Experience in N.U. Course

(mixed undergrads and grads)

 1. Chaitin’s Algorithm
 2. Fast Permutation Multiplication
 3. Kernighan-Lin Partitioning Algorithm
 4. Large matrix-matrix Multiplication
 5. Voronoi Diagrams
 6. Cellular Automata
 7. GAA* Search
 8. Static Performance Evaluation for Memory Bound Computing

Others:
[BFS using External Sort] BFS using External Sort
[BFS using Segments & Hash Array] BFS using Segments & Hash Array
[Fast Permutation Multiplication] Fast Permutation Multiplication
[Kernighan-Lin Partitioning Algorithm] Kernighan-Lin Partitioning Algorithm
[Large matrix-matrix Multiplication] Large matrix-matrix Multiplication
Example: Rubik’s Cube: Sorting Delayed Duplicate Detection

1. Breadth-first search: storing new frontier (open list) on disk
2. Use Bucket Sorting to sort and eliminate duplicate states from the new
   frontier
   (The bucket size is chosen to fit in RAM (the new cache).
3. Storing the new frontier requires 6 terabytes of disk space (and we would
   use more if we had it). Saving a large new frontier on disk prior to sorting
   delays duplicate detection, but makes the routine more efficient due to
   economies of scale.
Rubik’s Cube: Two-Bit trick

1. The final representation of the state space (1.4 × 1012 states) could use only 2 bits per
   state. (We use 4 bits per state for convenience.)
2. We used mathematical group theory to derive a highly dense, perfect hash function (no
   collisions) for the states of |cube|/|S|.
3. Our hash function represents symmetrized cosets (the union of all symmetric states of
   |cube|/|S| under the symmetries of the cube).
4. Each hash slot need only store the level in the search tree modulo 3. This allows
   the algorithm to distinguish states from the current frontier, the next frontier, and the
   previous frontier (current level; current level plus one; and current level minus one).
   This is all that is needed.
Space-Time Tradeoffs using Additional Disk

  • Use even more disk space in order to speed up the algorithm.




“A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, Eric Robinson,
Daniel Kunkle and Gene Cooperman, Proc. of 2007 International Workshop on Parallel Symbolic and
Algebraic Computation (PASCO ’07), ACM Press, 2007, pp. 78–87
LONGER-TERM GOAL: Mini-Language Extension

Well-understood building blocks already exist: external sorting, B-trees, Bloom filters,
Delayed Duplicate Detection, Distributed Hash Trees (DHT), and some still more exotic
algorithms.


GOAL: Provide language extensions for common data structures and algorithms (including
breadth-first search) that invoke a run-time library. Design the language to bias the
programmer toward efficient use of disk.


ROOMY LANGUAGE:
New Parallel Disk-Based Language, Roomy, in development by Daniel Kunkle.
Implementation: Run-time C library with #define and typedef for nicer syntax.
Language appears to be sequential; back-end based on cluster with local disks; or cluster
with SAN; or single computer using RAM (for simpler development and debugging)
Expected availability: mid-2009

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
Node.js at Joyent: Engineering for Production
Node.js at Joyent: Engineering for ProductionNode.js at Joyent: Engineering for Production
Node.js at Joyent: Engineering for Productionjclulow
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่atit604
 
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker SessionMikyung Kang
 
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Igalia
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 

Was ist angesagt? (20)

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
CUDA
CUDACUDA
CUDA
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Cuda
CudaCuda
Cuda
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
Cuda
CudaCuda
Cuda
 
Node.js at Joyent: Engineering for Production
Node.js at Joyent: Engineering for ProductionNode.js at Joyent: Engineering for Production
Node.js at Joyent: Engineering for Production
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่
 
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session
 
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 

Ähnlich wie IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Storage: Alternate Futures
Storage: Alternate FuturesStorage: Alternate Futures
Storage: Alternate Futures小新 制造
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...Ben Stopford
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaPeter Lawrey
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 

Ähnlich wie IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU) (20)

Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Storage: Alternate Futures
Storage: Alternate FuturesStorage: Alternate Futures
Storage: Alternate Futures
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in Java
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Shignled disk
Shignled diskShignled disk
Shignled disk
 
Performance
PerformancePerformance
Performance
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 

Mehr von npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 

Mehr von npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 

Kürzlich hochgeladen

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

  • 1. Out-of-Core Programming with NVIDIA’s CUDA Gene Cooperman High Performance Computing Lab College of Computer and Information Science Northeastern University Boston, Massachusetts 02115 USA gene@ccs.neu.edu
  • 2. Pencil and Paper Calculation • GeForce 8800: – 16 CPU chips/Streaming Multiprocessors (SMs), 8 Cores per chip : 128 cores – Aggregate bandwidth to off-chip global memory: 86.4 GB/s (optimal) – Average bandwidth to global memory per core: 0.67 GB/s • Motherboard – 4 CPU cores – About 10 GB/s bandwidth to main RAM – Average bandwidth to RAM per core: 2.5 GB/s
  • 3. Keeping Pipe to Memory Flowing • Thread block: threads on a single chip • Thread block organized into warps • Warp of 32 threads required (minimize overhead of switching thread blocks) • Highest bandwidth when all SMs executing same code
  • 4. Memory-Bound Computations • So, how much data can we keep in the SMs before it overflows? • 16 KB/SM −→ 256 KB total cache • Any computation with an active working set of more than 256 KB risks being memory bound.
  • 5. Memory Bandwidth in Numbers (Thanks to Kapil Arya and Viral Gupta; Illustrative for trends, only) X-Axis: number of thread blocks Y-Axis: bandwidth (MB/s) Different curves: number of threads per thread block.
  • 6. Is Life Any Better Back on the Motherboard? • Up to 10 GB/s bandwidth to motherboard (perhaps five times slower than NVIDIA in practice) • Four cores competing for bandwidth • Cache of at least 1 MB, and possibly much more (e.g L3 cache) • Conclusion: Less pressure on memory, but similar order of magnitude
  • 7. Is Life Any Better between CPU and Disk? • Between 0.05 GB and 0.1 GB bandwidth to disk • Four cores competing for bandwidth • Cache consists of 4 GB or more of RAM • Conclusion: huge pressure on memory (but RAM as cache is large)
  • 8. Our Solution • Disk is the New RAM • Bandwidth of Disk: ˜ 100 MB/s • Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s • Bandwidth of RAM: approximately 5 GB/s • Conclusion: 1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly idle disk space, is a good approximation to a shared memory computer with 200 CPU cores and a single subsystem with 25 TB of shared memory. (The arguments also work for a SAN with multiple access nodes, but we consider local disks for simplicity.) 2. The disks of a cluster can serve as if they were RAM. 3. The traditional RAM can then serve as if it were cache.
  • 9. Our Solution • Disk is the New RAM • Bandwidth of Disk: ˜ 100 MB/s • Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s • Bandwidth of RAM: approximately 5 GB/s • Conclusion: 1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly idle disk space, is a good approximation to a shared memory computer with 200 CPU cores and a single subsystem with 25 TB of shared memory. (The arguments also work for a SAN with multiple access nodes, but we consider local disks for simplicity.) 2. The disks of a cluster can serve as if they were RAM. 3. The traditional RAM can then serve as if it were cache.
  • 10. What About Disk Latency? • Unfortunately, putting 50 disks on it, doesn’t speed up the latency. • So, re-organize the data structures and low-level algorithms. • Our group has five years of case histories applying this computational algebra — but each case requires months of development and debugging. • We’re now developing both higher level abstractions for run-time libraries, and a language extension that will make future development much faster.
  • 11. Applications Benefiting from Disk-Based Parallel Computation Discipline Example Application 1. Verification Symbolic Computation using BDDs 2. Verification Explicit State Verification 3. Comp. Group Theory Search and Enumeration in Mathematical Structures 4. Coding Theory Search for New Codes 5. Security Exhaustive Search for Passwords 6. Semantic Web RDF query language; OWL Web Ontology Language 7, Artificial Intelligence Planning 8. Proteomics Protein folding via a kinetic network model 9. Operations Research Branch and Bound 10. Operations Research Integer Programming (applic. of Branch-and-Bound) 11. Economics Dynamic Programming 12. Numerical Analysis ATLAS, PHiPAC, FFTW, and other adaptive software 13. Engineering Sensor Data 14. A.I. Search Rubik’s Cube
  • 12. Central Claim Suppose one had a single computer with 10 terabytes of RAM and 200 CPU cores. Does that satisfy your need for computers with more RAM? CLAIM: A computer cluster of 32 quad-core nodes, each with a 500 GB local disk, is a good approximation of the above computer. (The arguments also work for a SAN with multiple access nodes, but we discuss local disks for simplicity.)
  • 13. When is a cluster like a 10 TB shared memory computer? • Assume 200 GB/node of free disk space • Assume 50 nodes, • The bandwidth of 50 disks is 50 × 100MB/s = 5GB/s. • The bandwidth of a single RAM subsystem is about 5GB/s. CLAIM: You probably have the 10 TB of temporary disk space lying idle on your own recent-model computer cluster. You just didn’t know it. (Or were you just not telling other people about the space, so you could use if for yourself?) The economics of disks are such that one saves very little by buying less than 500 GB disk per node. It’s common to buy the 500 GB disk, and reserve the extra space for expansion.
  • 14. When is a cluster NOT like a 10 TB shared memory computer? 1. We require a parallel program. (We must access the local disks of many cluster nodes in parallel.) 2. The latency problem of disk. 3. Can the network keep up with the disk?
  • 15. When is a cluster NOT like a 10 TB shared memory computer? . . . and why doesn’t it matter for our purposes? • ANSWER 1: We’ve used this architecture, and it works for us. • We’ve developed solutions for a series of algorithmically simple computational kernels from computational algebra — especially mathematical group theory. All of the following computations completed in less than one cluster-week on a cluster of 60 nodes or less. – Construction of Thompson Sporadic Simple Group (2003) 2 gigabytes (temporary space), 1.4 × 108 states, 4 bytes per state – Construction of Baby Monster Sporadic Simple Group (2006) 6 terabytes (temporary space), 1.4 × 1010 states, 12 bytes per state – Condensation of Fi23 Sporadic Simple Group (2007) 400 GB (temporary space) 1.2 × 1010 states, 30 bytes per state (larger condensation for J4 now in progress) – Rubik’s Cube: 26 Moves Suffice to Solve Rubik’s Cube (2007) 7 terabytes (temporary space), 1012 states, 6 bytes per state – In progress: coset enumeration (pointer-chasing: similar to algorithm for converting NFA to DFA (finite automata)).
  • 16. When is a cluster NOT like a 10 TB shared memory computer? 1. We require a parallel program. 2. The latency problem of disk. 3. Can the network keep up with the disk?
  • 17. When is a cluster NOT like a 10 TB shared memory computer? . . . and why doesn’t it matter for our purposes? 1. We require a parallel program. (We must access the local disks of many nodes in parallel.) • Our bet (still to be proved): Any sequential algorithm that already creates gigabytes of RAM-based data should have a way to create that data in parallel. 2. The latency problem of disk. Solutions exist: (a) For duplicates on frontier in state space search: Delayed Duplicate Detection implies waiting until many nodes of the next frontier (and duplicates from previous iterations) have been discovered. Then remove duplicates. (b) For hash tables, wait until there are millions of hash queries. Then sort on the hash index, and scan the disk to resolve queries. (c) For pointer-chasing, wait until millions of pointers are available for chasing. Then sort and scan the disk to dereference pointers. (d) For tracing strings, with each string being a lookup, wait until millions of strings are available. Then .... 3. Can the network keep up with the disk?
  • 18. When is a cluster NOT like a 10 TB shared memory computer? . . . and why doesn’t it matter for our purposes? 1. We require a parallel program. (We must access the local disks of many nodes in parallel.) 2. The latency problem of disk. 3. Can the network keep up with the disk? (In our experience to date, the network does keep up. Here are some reasons why it seems to just work.) • The point-to-point bandwidth of Gigabit Ethernet is about 100 MB/s. The bandwidth of disk is about 100 MB/s. As long as the aggregate bandwidth of network can keep up, everything is fine. • Researchers already face the issue of aggregate network bandwidth in RAM-based programs. The disk is slower than RAM. So, probably traditional parallel programs can cope.
  • 19. Applications from Computational Group Theory (2003–2007) Space State Total Group Size Size Storage 1.17 × 1010 100 bytes Fischer Fi23 1 TB “Baby Monster” 1.35 × 1010 548 bytes 7 TB 1.31 × 1011 64 bytes Janko J4 8 TB (joint with Eric Robinson)
  • 20. History of Rubik’s Cube • Invented in late 1970s in Hungary. • In 1982, in Cubik Math, Singmaster and Frey conjectured: No one knows how many moves would be needed for “God’s Algorithm” assuming he always used the fewest moves required to restore the cube. It has been proven that some patterns must exist that require at least seventeen moves to restore but no one knows what those patterns may be. Experienced group theorists have conjectured that the smallest number of moves which would be sufficient to restore any scrambled pattern — that is, the number of moves required for “God’s Algorithm” — is probably in the low twenties. • Current Best Guess: 20 moves suffice – States needing 20 moves are known
  • 21. History of Rubik’s Cube (cont.) • Invented in late 1970s in Hungary. • 1982: “God’s Number” (number of moves needed) was known by authors of conjecture to be between 17 and 52. • 1990: C., Finkelstein, and Sarawagi showed 11 moves suffice for Rubik’s 2 × 2 × 2 cube (corner cubies only) • 1995: Reid showed 29 moves suffice (lower bound of 20 already known) • 2006: Radu showed 27 moves suffice • 2007 Kunkle and C. showed 26 moves suffice • 2008 Rockiki showed 22 moves suffice (using idle resources at Sony Pictures)
  • 22. Large-Memory Apps: Experience in N.U. Course (mixed undergrads and grads) 1. Chaitin’s Algorithm 2. Fast Permutation Multiplication 3. Kernighan-Lin Partitioning Algorithm 4. Large matrix-matrix Multiplication 5. Voronoi Diagrams 6. Cellular Automata 7. GAA* Search 8. Static Performance Evaluation for Memory Bound Computing Others: [BFS using External Sort] BFS using External Sort [BFS using Segments & Hash Array] BFS using Segments & Hash Array [Fast Permutation Multiplication] Fast Permutation Multiplication [Kernighan-Lin Partitioning Algorithm] Kernighan-Lin Partitioning Algorithm [Large matrix-matrix Multiplication] Large matrix-matrix Multiplication
  • 23. Example: Rubik’s Cube: Sorting Delayed Duplicate Detection 1. Breadth-first search: storing new frontier (open list) on disk 2. Use Bucket Sorting to sort and eliminate duplicate states from the new frontier (The bucket size is chosen to fit in RAM (the new cache). 3. Storing the new frontier requires 6 terabytes of disk space (and we would use more if we had it). Saving a large new frontier on disk prior to sorting delays duplicate detection, but makes the routine more efficient due to economies of scale.
  • 24. Rubik’s Cube: Two-Bit trick 1. The final representation of the state space (1.4 × 1012 states) could use only 2 bits per state. (We use 4 bits per state for convenience.) 2. We used mathematical group theory to derive a highly dense, perfect hash function (no collisions) for the states of |cube|/|S|. 3. Our hash function represents symmetrized cosets (the union of all symmetric states of |cube|/|S| under the symmetries of the cube). 4. Each hash slot need only store the level in the search tree modulo 3. This allows the algorithm to distinguish states from the current frontier, the next frontier, and the previous frontier (current level; current level plus one; and current level minus one). This is all that is needed.
  • 25. Space-Time Tradeoffs using Additional Disk • Use even more disk space in order to speed up the algorithm. “A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, Eric Robinson, Daniel Kunkle and Gene Cooperman, Proc. of 2007 International Workshop on Parallel Symbolic and Algebraic Computation (PASCO ’07), ACM Press, 2007, pp. 78–87
  • 26. LONGER-TERM GOAL: Mini-Language Extension Well-understood building blocks already exist: external sorting, B-trees, Bloom filters, Delayed Duplicate Detection, Distributed Hash Trees (DHT), and some still more exotic algorithms. GOAL: Provide language extensions for common data structures and algorithms (including breadth-first search) that invoke a run-time library. Design the language to bias the programmer toward efficient use of disk. ROOMY LANGUAGE: New Parallel Disk-Based Language, Roomy, in development by Daniel Kunkle. Implementation: Run-time C library with #define and typedef for nicer syntax. Language appears to be sequential; back-end based on cluster with local disks; or cluster with SAN; or single computer using RAM (for simpler development and debugging) Expected availability: mid-2009