Suche senden
Hochladen
Lec13 multidevice
•
Als PPTX, PDF herunterladen
•
2 gefällt mir
•
591 views
Taras Zakharchenko
Folgen
Technologie
Melden
Teilen
Melden
Teilen
1 von 18
Jetzt herunterladen
Empfohlen
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
Lec09 nbody-optimization
Lec09 nbody-optimization
Taras Zakharchenko
Lec11 timing
Lec11 timing
Taras Zakharchenko
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
Lec05 buffers basic_examples
Lec05 buffers basic_examples
Taras Zakharchenko
Lec02 03 opencl_intro
Lec02 03 opencl_intro
Taras Zakharchenko
Lec06 memory
Lec06 memory
Taras Zakharchenko
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
Karel Ha
Empfohlen
Lec07 threading hw
Lec07 threading hw
Taras Zakharchenko
Lec09 nbody-optimization
Lec09 nbody-optimization
Taras Zakharchenko
Lec11 timing
Lec11 timing
Taras Zakharchenko
Lec08 optimizations
Lec08 optimizations
Taras Zakharchenko
Lec05 buffers basic_examples
Lec05 buffers basic_examples
Taras Zakharchenko
Lec02 03 opencl_intro
Lec02 03 opencl_intro
Taras Zakharchenko
Lec06 memory
Lec06 memory
Taras Zakharchenko
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
Karel Ha
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
Parallel computation
Parallel computation
Jayanti Prasad Ph.D.
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
Kusano Hitoshi
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
Task and Data Parallelism
Task and Data Parallelism
Sasha Goldshtein
Example uses of gpu compute models
Example uses of gpu compute models
Pedram Mazloom
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
specialk29
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Antti Vanne
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
paul0001
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
Green scheduling
Green scheduling
Vincenzo De Maio
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
Salt Identification Challenge
Salt Identification Challenge
kenluck2001
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
Performance and predictability
Performance and predictability
RichardWarburton
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
Weitere ähnliche Inhalte
Was ist angesagt?
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
Parallel computation
Parallel computation
Jayanti Prasad Ph.D.
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
Kusano Hitoshi
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
Task and Data Parallelism
Task and Data Parallelism
Sasha Goldshtein
Example uses of gpu compute models
Example uses of gpu compute models
Pedram Mazloom
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
specialk29
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Antti Vanne
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
paul0001
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
Green scheduling
Green scheduling
Vincenzo De Maio
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
Salt Identification Challenge
Salt Identification Challenge
kenluck2001
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
Performance and predictability
Performance and predictability
RichardWarburton
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
Was ist angesagt?
(20)
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Parallel computation
Parallel computation
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Task and Data Parallelism
Task and Data Parallelism
Example uses of gpu compute models
Example uses of gpu compute models
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
CUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Green scheduling
Green scheduling
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Salt Identification Challenge
Salt Identification Challenge
Exploring Gpgpu Workloads
Exploring Gpgpu Workloads
Performance and predictability
Performance and predictability
TensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Ähnlich wie Lec13 multidevice
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
Survey of open source cloud architectures
Survey of open source cloud architectures
abhinav vedanbhatla
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
Ryan Aydelott
Clone cloud
Clone cloud
Bhagavathi Dhass
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
Dr. Thippeswamy S.
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Eric Van Hensbergen
Engineer Engineering Software
Engineer Engineering Software
Yung-Yu Chen
Exascale Capabl
Exascale Capabl
Sagar Dolas
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
samaghorab
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
Computer_Clustering_Technologies
Computer_Clustering_Technologies
Manish Chopra
Parallel computing persentation
Parallel computing persentation
VIKAS SINGH BHADOURIA
Openstack_administration
Openstack_administration
Ashish Sharma
Cluster computer
Cluster computer
Ashraful Hoda
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
GOVERNMENT COLLEGE OF ENGINEERING,TIRUNELVELI
parallel programming models
parallel programming models
Swetha S
Clustering by AKASHMSHAH
Clustering by AKASHMSHAH
Akash M Shah
Microsoft Dryad
Microsoft Dryad
Colin Clark
Ähnlich wie Lec13 multidevice
(20)
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
Survey of open source cloud architectures
Survey of open source cloud architectures
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
Clone cloud
Clone cloud
Programmable Exascale Supercomputer
Programmable Exascale Supercomputer
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
Systems Support for Many Task Computing
Systems Support for Many Task Computing
Engineer Engineering Software
Engineer Engineering Software
Exascale Capabl
Exascale Capabl
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Computer_Clustering_Technologies
Computer_Clustering_Technologies
Parallel computing persentation
Parallel computing persentation
Openstack_administration
Openstack_administration
Cluster computer
Cluster computer
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
parallel programming models
parallel programming models
Clustering by AKASHMSHAH
Clustering by AKASHMSHAH
Microsoft Dryad
Microsoft Dryad
Kürzlich hochgeladen
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
RankYa
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Miki Katsuragi
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Rizwan Syed
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
comworks
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Precisely
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Manik S Magar
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Scott Keck-Warren
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
Kürzlich hochgeladen
(20)
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lec13 multidevice
1.
Programming Multiple Devices
Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
2.
Instructor Notes This
lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts), and the tradeoffs associated with each approach The lecture concludes with a quick discussion of heterogeneous load-balancing issues when working with multiple devices 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
3.
Approaches to Multiple
Devices Single context, multiple devices Standard way to work with multiple devices in OpenCL Multiple contexts, multiple devices Computing on a cluster, multiple systems, etc. Considerations for CPU-GPU heterogeneous computing 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
4.
Single Context, Multiple
Devices Nomenclature: “clEnqueue*” is used to describe any of the clEnqueue commands (i.e., those that interact with a device) E.g. clEnqueueNDRangeKernel(), clEnqueueReadImage() “clEnqueueRead*” and “clEnqueueWrite*” are used to describe reading/writing to either buffers or images E.g. clEnqueueReadBuffer(), clEnqueueWriteImage() 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
5.
Single Context, Multiple
Devices Associating specific devices with a context is done by passing a list of the desired devices to clCreateContext() The call clCreateContextFromType() takes a device type (or combination of types) as a parameter and creates a context with all devices of that type: 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
6.
Single Context, Multiple
Devices When multiple devices are part of the same context, most OpenCL objects are shared Memory objects, programs, kernels, etc. One command queue must exist per device and is supplied in OpenCL when the target GPU needs to be specified Any clEnqueue* function takes a command queue as an argument Context 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
7.
Single Context, Multiple
Devices While memory objects are common to a context, they must be explicitly written to a device before being used Whether or not the same object can be valid on multiple devices is vendor specific OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host Context 2) Object now valid on host 1) clEnqueueRead*(cq0, ...) copies object to host 3) clEnqueueWrite*(cq1, ...) copies object to device 1 0) Object starts on device 0 4) Object ends up on device 1 TWO PCIe DATA TRANSFERS ARE REQUIRED 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
8.
Single Context, Multiple
Devices The behavior of a memory object written to multiple devices is vendor-specific OpenCL does not define if a copy of the object is made or whether the object remains valid once written to a device We can imagine that a CPU would operate on a memory object in-place, while a GPU would make a copy (so the original would still be valid until it is explicitly written over) Fusion GPUs from AMD could potentially operate on data in-place as well Currently AMD/NVIDIA implementations allow an object to be copied to multiple devices (even if the object will be written to) When data is read back, separate host pointers must be supplied or one set of results will be clobbered Context When writing data to a GPU, a copy is made, so multiple writes are valid clEnqueueWrite*(cq0, ...) clEnqueueWrite*(cq1, ...) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
9.
Single Context, Multiple
Devices Just like writing a multi-threaded CPU program, we have two choices for designing multi-GPU programs Redundantly copy all data and index using global offsets Split the data into subsets and index into the subset GPU 1 GPU 0 A A Threads 0 1 2 3 4 5 6 7 GPU 1 GPU 0 A0 A1 Threads 0 1 2 3 0 1 2 3 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
10.
Single Context, Multiple
Devices OpenCL provides mechanisms to help with both multi-device techniques clEnqueueNDRangeKernel() optionally takes offsets that are used when computing the global ID of a thread Note that for this technique to work, any objects that are written to will have to be synchronized manually SubBufferswere introduced in OpenCL 1.1 to allow a buffer to be split into multiple objects This allows reading/writing to offsets within a buffer to avoid manually splitting and recombining data 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
11.
Single Context, Multiple
Devices OpenCLevents are used to synchronize execution on different devices within a context Each clEnqueue* function generates an event that identifies the operation Each clEnqueue* function also takes an optional list of events that must complete before that operation should occur clEnqueueWaitForEvents() is the specific call to wait for a list of events to complete Events are also used for profiling and were covered in more detail in Lecture 11 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
12.
Multiple Contexts, Multiple
Devices An alternative approach is to create a redundant OpenCL context (with associated objects) per device Perhaps is an easier way to split data (based on the algorithm) Would not have to worry about coding for a variable number of devices Could use CPU-based synchronization primitives (such as locks, barriers, etc.) Communicate using host-based libraries Context Context 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
13.
Follows SPMD model
more closely CUDA/C’s runtime-API approach to multi-device code No code required to consider explicitly moving data between a variable number of devices Using functions such as scatter/gather, broadcast, etc. may be easier than creating subbuffers, etc. for a variable number of devices Supports distributed programming If a distributed framework such as MPI is used for communication, programs can be ran on multi-device machines or in distributed environments Multiple Contexts, Multiple Devices 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
14.
In addition to
PCI-Express transfers required to move data between host and device, extra memory and network communication may be required Host libraries (e.g., pthreads, MPI) must be used for synchronization and communication Multiple Contexts, Multiple Devices 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
15.
Heterogeneous Computing Targeting
heterogeneous devices (e.g., CPUs and GPUs at the same time) requires awareness of their different performance characteristics for an application To generalize: Context *otherwise application wouldn’t use OpenCL 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
16.
Heterogeneous Computing Factors
to consider Scheduling overhead What is the startup time of each device? Location of data Which device is the data currently resident on? Data must be transferred across the PCI-Express bus Granularity of workloads How should the problem be divided? What is the ratio of startup time to actual work Execution performance relative to other devices How should the work be distributed? 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
17.
Heterogeneous Computing Granularity
of scheduling units must be weighed Workload sizes that are too large may execute slowly on a device, stalling overall completion Workload sizes that are too small may be dominated by startup overhead Approach to load-balancing #1: Begin scheduling small workload sizes Profile execution times on each device Extrapolate execution profiles for larger workload sizes Schedule with larger workload sizes to avoid unnecessary overhead Approach to load-balancing #2: If one device is much faster than anything else in the system, just run on that device 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
18.
Summary There are
different approaches to multi-device programming Single context, multiple devices Can only communicate with devices recognized by one vendor Code must be written for a general number of devices Multiple contexts, multiple devices More like distributed programming Code can be written for a single device (or multiple devices), with explicit movement of data between contexts 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Hinweis der Redaktion
If CPUs were being used and operating on data in-place, the commands to transfer data would likely not actually do any copying.
If the contexts were on the same physical machine, pthreads could be used to communicate and synchronize. In a distributed setting, MPI could be used.
Jetzt herunterladen