Suche senden
Hochladen
Lec13 multidevice
•
Als PPTX, PDF herunterladen
•
2 gefällt mir
•
591 views
Taras Zakharchenko
Folgen
Technologie
Melden
Teilen
Melden
Teilen
1 von 18
Jetzt herunterladen
Empfohlen
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
Lec09 nbody-optimization
Lec09 nbody-optimization
Taras Zakharchenko
Lec11 timing
Lec11 timing
Taras Zakharchenko
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
Lec05 buffers basic_examples
Lec05 buffers basic_examples
Taras Zakharchenko
Lec02 03 opencl_intro
Lec02 03 opencl_intro
Taras Zakharchenko
Lec06 memory
Lec06 memory
Taras Zakharchenko
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
Karel Ha
Empfohlen
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
Lec09 nbody-optimization
Lec09 nbody-optimization
Taras Zakharchenko
Lec11 timing
Lec11 timing
Taras Zakharchenko
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
Lec05 buffers basic_examples
Lec05 buffers basic_examples
Taras Zakharchenko
Lec02 03 opencl_intro
Lec02 03 opencl_intro
Taras Zakharchenko
Lec06 memory
Lec06 memory
Taras Zakharchenko
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
Karel Ha
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
Parallel computation
Parallel computation
Jayanti Prasad Ph.D.
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
Kusano Hitoshi
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
Task and Data Parallelism
Task and Data Parallelism
Sasha Goldshtein
Example uses of gpu compute models
Example uses of gpu compute models
Pedram Mazloom
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
specialk29
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Antti Vanne
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
paul0001
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
Green scheduling
Green scheduling
Vincenzo De Maio
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
Salt Identification Challenge
Salt Identification Challenge
kenluck2001
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
Performance and predictability
Performance and predictability
RichardWarburton
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
Weitere ähnliche Inhalte
Was ist angesagt?
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
Parallel computation
Parallel computation
Jayanti Prasad Ph.D.
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
Kusano Hitoshi
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
Task and Data Parallelism
Task and Data Parallelism
Sasha Goldshtein
Example uses of gpu compute models
Example uses of gpu compute models
Pedram Mazloom
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
specialk29
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Antti Vanne
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
paul0001
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
Green scheduling
Green scheduling
Vincenzo De Maio
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
Salt Identification Challenge
Salt Identification Challenge
kenluck2001
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
Performance and predictability
Performance and predictability
RichardWarburton
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
Was ist angesagt?
(20)
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Parallel computation
Parallel computation
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Task and Data Parallelism
Task and Data Parallelism
Example uses of gpu compute models
Example uses of gpu compute models
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Green scheduling
Green scheduling
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Salt Identification Challenge
Salt Identification Challenge
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Performance and predictability
Performance and predictability
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Ähnlich wie Lec13 multidevice
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
Survey of open source cloud architectures
Survey of open source cloud architectures
abhinav vedanbhatla
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
Ryan Aydelott
Clone cloud
Clone cloud
Bhagavathi Dhass
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
Dr. Thippeswamy S.
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Eric Van Hensbergen
Engineer Engineering Software
Engineer Engineering Software
Yung-Yu Chen
Exascale Capabl
Exascale Capabl
Sagar Dolas
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
samaghorab
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
Computer_Clustering_Technologies
Computer_Clustering_Technologies
Manish Chopra
Parallel computing persentation
Parallel computing persentation
VIKAS SINGH BHADOURIA
Openstack_administration
Openstack_administration
Ashish Sharma
Cluster computer
Cluster computer
Ashraful Hoda
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
GOVERNMENT COLLEGE OF ENGINEERING,TIRUNELVELI
parallel programming models
parallel programming models
Swetha S
Clustering by AKASHMSHAH
Clustering by AKASHMSHAH
Akash M Shah
Microsoft Dryad
Microsoft Dryad
Colin Clark
Ähnlich wie Lec13 multidevice
(20)
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
Survey of open source cloud architectures
Survey of open source cloud architectures
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
Clone cloud
Clone cloud
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Engineer Engineering Software
Engineer Engineering Software
Exascale Capabl
Exascale Capabl
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Computer_Clustering_Technologies
Computer_Clustering_Technologies
Parallel computing persentation
Parallel computing persentation
Openstack_administration
Openstack_administration
Cluster computer
Cluster computer
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
parallel programming models
parallel programming models
Clustering by AKASHMSHAH
Clustering by AKASHMSHAH
Microsoft Dryad
Microsoft Dryad
Kürzlich hochgeladen
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Maria Levchenko
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Radu Cotescu
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Antenna Manufacturer Coco
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Delhi Call girls
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Delhi Call girls
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Neo4j
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Results
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
naman860154
Kürzlich hochgeladen
(20)
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Lec13 multidevice
1.
Programming Multiple Devices
Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
2.
Instructor Notes This
lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts), and the tradeoffs associated with each approach The lecture concludes with a quick discussion of heterogeneous load-balancing issues when working with multiple devices 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
3.
Approaches to Multiple
Devices Single context, multiple devices Standard way to work with multiple devices in OpenCL Multiple contexts, multiple devices Computing on a cluster, multiple systems, etc. Considerations for CPU-GPU heterogeneous computing 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
4.
Single Context, Multiple
Devices Nomenclature: “clEnqueue*” is used to describe any of the clEnqueue commands (i.e., those that interact with a device) E.g. clEnqueueNDRangeKernel(), clEnqueueReadImage() “clEnqueueRead*” and “clEnqueueWrite*” are used to describe reading/writing to either buffers or images E.g. clEnqueueReadBuffer(), clEnqueueWriteImage() 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
5.
Single Context, Multiple
Devices Associating specific devices with a context is done by passing a list of the desired devices to clCreateContext() The call clCreateContextFromType() takes a device type (or combination of types) as a parameter and creates a context with all devices of that type: 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
6.
Single Context, Multiple
Devices When multiple devices are part of the same context, most OpenCL objects are shared Memory objects, programs, kernels, etc. One command queue must exist per device and is supplied in OpenCL when the target GPU needs to be specified Any clEnqueue* function takes a command queue as an argument Context 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
7.
Single Context, Multiple
Devices While memory objects are common to a context, they must be explicitly written to a device before being used Whether or not the same object can be valid on multiple devices is vendor specific OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host Context 2) Object now valid on host 1) clEnqueueRead*(cq0, ...) copies object to host 3) clEnqueueWrite*(cq1, ...) copies object to device 1 0) Object starts on device 0 4) Object ends up on device 1 TWO PCIe DATA TRANSFERS ARE REQUIRED 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
8.
Single Context, Multiple
Devices The behavior of a memory object written to multiple devices is vendor-specific OpenCL does not define if a copy of the object is made or whether the object remains valid once written to a device We can imagine that a CPU would operate on a memory object in-place, while a GPU would make a copy (so the original would still be valid until it is explicitly written over) Fusion GPUs from AMD could potentially operate on data in-place as well Currently AMD/NVIDIA implementations allow an object to be copied to multiple devices (even if the object will be written to) When data is read back, separate host pointers must be supplied or one set of results will be clobbered Context When writing data to a GPU, a copy is made, so multiple writes are valid clEnqueueWrite*(cq0, ...) clEnqueueWrite*(cq1, ...) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
9.
Single Context, Multiple
Devices Just like writing a multi-threaded CPU program, we have two choices for designing multi-GPU programs Redundantly copy all data and index using global offsets Split the data into subsets and index into the subset GPU 1 GPU 0 A A Threads 0 1 2 3 4 5 6 7 GPU 1 GPU 0 A0 A1 Threads 0 1 2 3 0 1 2 3 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
10.
Single Context, Multiple
Devices OpenCL provides mechanisms to help with both multi-device techniques clEnqueueNDRangeKernel() optionally takes offsets that are used when computing the global ID of a thread Note that for this technique to work, any objects that are written to will have to be synchronized manually SubBufferswere introduced in OpenCL 1.1 to allow a buffer to be split into multiple objects This allows reading/writing to offsets within a buffer to avoid manually splitting and recombining data 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
11.
Single Context, Multiple
Devices OpenCLevents are used to synchronize execution on different devices within a context Each clEnqueue* function generates an event that identifies the operation Each clEnqueue* function also takes an optional list of events that must complete before that operation should occur clEnqueueWaitForEvents() is the specific call to wait for a list of events to complete Events are also used for profiling and were covered in more detail in Lecture 11 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
12.
Multiple Contexts, Multiple
Devices An alternative approach is to create a redundant OpenCL context (with associated objects) per device Perhaps is an easier way to split data (based on the algorithm) Would not have to worry about coding for a variable number of devices Could use CPU-based synchronization primitives (such as locks, barriers, etc.) Communicate using host-based libraries Context Context 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
13.
Follows SPMD model
more closely CUDA/C’s runtime-API approach to multi-device code No code required to consider explicitly moving data between a variable number of devices Using functions such as scatter/gather, broadcast, etc. may be easier than creating subbuffers, etc. for a variable number of devices Supports distributed programming If a distributed framework such as MPI is used for communication, programs can be ran on multi-device machines or in distributed environments Multiple Contexts, Multiple Devices 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
14.
In addition to
PCI-Express transfers required to move data between host and device, extra memory and network communication may be required Host libraries (e.g., pthreads, MPI) must be used for synchronization and communication Multiple Contexts, Multiple Devices 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
15.
Heterogeneous Computing Targeting
heterogeneous devices (e.g., CPUs and GPUs at the same time) requires awareness of their different performance characteristics for an application To generalize: Context *otherwise application wouldn’t use OpenCL 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
16.
Heterogeneous Computing Factors
to consider Scheduling overhead What is the startup time of each device? Location of data Which device is the data currently resident on? Data must be transferred across the PCI-Express bus Granularity of workloads How should the problem be divided? What is the ratio of startup time to actual work Execution performance relative to other devices How should the work be distributed? 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
17.
Heterogeneous Computing Granularity
of scheduling units must be weighed Workload sizes that are too large may execute slowly on a device, stalling overall completion Workload sizes that are too small may be dominated by startup overhead Approach to load-balancing #1: Begin scheduling small workload sizes Profile execution times on each device Extrapolate execution profiles for larger workload sizes Schedule with larger workload sizes to avoid unnecessary overhead Approach to load-balancing #2: If one device is much faster than anything else in the system, just run on that device 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
18.
Summary There are
different approaches to multi-device programming Single context, multiple devices Can only communicate with devices recognized by one vendor Code must be written for a general number of devices Multiple contexts, multiple devices More like distributed programming Code can be written for a single device (or multiple devices), with explicit movement of data between contexts 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Hinweis der Redaktion
If CPUs were being used and operating on data in-place, the commands to transfer data would likely not actually do any copying.
If the contexts were on the same physical machine, pthreads could be used to communicate and synchronize. In a distributed setting, MPI could be used.
Jetzt herunterladen