SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Evgeny Fiksman, Sergey Vinogradov and Michael Voss
Intel Corporation
November 2016
Intel® Threading Building Blocks (Intel® TBB)
Celebrating it’s 10 year anniversary in 2016!
A widely used C++ template library for parallel programming
What
Parallel algorithms and data structures
Threads and synchronization primitives
Scalable memory allocation and task scheduling
Benefits
Is a library-only solution that does not depend on special compiler support
Is both a commercial product and an open-source project
Supports C++, Windows*, Linux*, OS X*, Android* and other OSes
Commercial support for Intel® AtomTM, CoreTM, Xeon® processors and for Intel® Xeon PhiTM coprocessors
http://threadingbuildingblocks.org http://software.intel.com/intel-tbb
3
Applications often contain three levels of
parallelism
Task Parallelism /
Message Passing
fork-join
SIMD SIMD SIMD
fork-join
SIMD SIMD SIMD
4
Intel® Threading Building Blocks
threadingbuildingblocks.org
Generic Parallel
Algorithms
Efficient scalable
way to exploit the
power of multi-core
without having to
start from scratch.
Concurrent Containers
Concurrent access, and a scalable alternative to
serial containers with external locking
Task Scheduler
Sophisticated work scheduling engine that
empowers parallel algorithms and flow
graph
Threads
OS API
wrappers
Miscellaneous
Thread-safe
timers and
exception classes
Memory Allocation
Scalable memory manager and false-sharing free allocators
Synchronization Primitives
Atomic operations, a variety of mutexes with different
properties, condition variables
Flow Graph
A set of classes to
express parallelism
as a graph of
compute
dependencies
and/or data flow
Parallel algorithms and data structures
Threads and synchronization
Memory allocation and task scheduling
Thread Local Storage
Unlimited number of
thread-local variables
5
Mandelbrot Speedup
Intel® Threading Building Blocks (Intel® TBB)
parallel_for( 0, max_row,
[&](int i) {
for (int j = 0; j < max_col; j++)
p[i][j]=mandel(Complex(scale(i),scale(j)),depth);
}
);
int mandel(Complex c, int max_count) {
int count = 0; Complex z = 0;
for (int i = 0; i < max_count; i++) {
if (abs(z) >= 2.0) break;
z = z*z + c; count++;
}
return count;
}
Parallel algorithm
Use C++ lambda functions to define function object in-line
Task is a function object
6
Intel Threading Building Blocks flow graph
Efficient implementation of dependency graph and data flow algorithms
Design for shared memory application
Enables developers to exploit parallelism at higher levels
graph g;
continue_node< continue_msg > h( g,
[]( const continue_msg & ) {
cout << “Hello “;
} );
continue_node< continue_msg > w( g,
[]( const continue_msg & ) {
cout << “Worldn“;
} );
make_edge( h, w );
h.try_put(continue_msg());
g.wait_for_all();
Hello World
7
Intel TBB Flow Graph node types:
Functional
f() f() f(x) f(x)
source_node continue_node function_node multifunction_node
Buffering
buffer_node queue_node priority_queue_node sequencer_node
1 023
Split /
Join
queueing join reserving join tag matching join split_node indexer_node
Other
broadcast_node write_once_node overwrite_node limiter_node
8
An example feature detection algorithm
buffer
get_next_image
preprocess
detect_with_A
detect_with_B
make_decision
Can express pipelining, task parallelism and data parallelism
9
Heterogeneous support in Intel® TBB
Intel TBB as a coordination layer for heterogeneity that provides flexibility,
retains optimization opportunities and composes with existing models
Intel TBB as a composability layer for library implementations
• One threading engine underneath all CPU-side work
Intel TBB flow graph as a coordination layer
• Be the glue that connects hetero HW and SW together
• Expose parallelism between blocks; simplify integration
+
Intel® Threading Building Blocks
OpenVX*
OpenCL*
COI/SCIF
DirectCompute*
Vulkan*
….
FPGAs, integrated and discrete GPUs, co-processors, etc…
1
Feature Description Diagram
async_node<Input,Output Basic building block. Enables async
communication from a single/isolated
node to an async activity. User
responsible for managing
communication. Graph runs on host.
async_msg<T>
Available as preview feature
Basic building block. Enables async
communication with chaining across
graph nodes. User responsible for
managing communication. Graph runs
on the host.
Support for Heterogeneous Programming in Intel TBB
So far all support is within the flow graph API
11
async_node example
• Allows the data flow graph to offload data to any asynchronous
activity and receive the data back to continue execution on the CPU
async_node makes coordinating with
any model easier and efficient
12
igpu
Feature Description Diagram
streaming_node
Available as preview feature
Higher level abstraction for
streaming models; e.g. OpenCL,
Direct X Compute, GFX, etc.... Users
provide Factory that describes
buffers, kernels, ranges, device
selection, etc… Uses async_msg so
supports chaining. Graph runs on
the host.
opencl_node
Available as preview feature
A specialization of streaming_node
for OpenCL. User provides OpenCL
program and kernel and runtime
handles initialization, buffer
management, communications, etc..
Graph runs on host.
Support for Heterogeneous Programming in Intel TBB
So far all support is within the flow graph API
13
Proof-of-concept: distributor_node
NOTE: async_node and composite_node are released features; distributor_node is a proof-of-concept
14
An example application: STAC-A2*
The STAC-A2 Benchmark suite is the industry standard for testing technology stacks used for compute-intensive analytic
workloads involved in pricing and risk management.
STAC-A2 is a set of specifications
 For Market-Risk Analysis, proxy for real life risk analytic and computationally intensive workloads
 Customers define the specifications
 Vendors implement the code
 Intel first published the benchmark results in Supercomputing’12
– http://www.stacresearch.com/SC12_submission_stac.pdf
– http://sc12.supercomputing.org/schedule/event_detail.php?evid=wksp138
STAC-A2 evaluates the Greeks For American-style options
 Monte Carlo based Heston Model with Stochastic Volatility
 Greeks describe the sensitivity of price of options to changes in parameters of the underlying market
– Compute 7 types of Greeks, ex: Theta – sensitivity to the passage of time, Rho – sensitivity for the interest rate
* “STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center LLC.
15
STAC-A2 Implementation Overview
• Implemented with:
• Intel TBB flow graph for task distribution
• Intel TBB parallel algorithms for for-join constructs
• Intel Compiler & OpenMP 4.0 for vectorization
• Intel® Math Kernel Library (Intel® MKL) for RND generation and Matrix operations
• Uses asynchronous support in flow graph to implement “Distributor Node” and offload to the Intel Xeon Phi
coprocessor - heterogeneity
• Using a token-based approach for dynamic load balancing between the main CPU and coprocessors
Application/STAC-A2
TBB Flow Graph
Distributor Node
Communication
Infrastructure
(Intel® MPSS)
TBB
Scheduler
TBB Flow Graph
TBB Scheduler
2x Intel Xeon E5 2697 v3
Intel Xeon Phi 7120P
Intel Xeon Phi 7120P
16
Task Parallelism /
Message Passing
fork-join
SIMD SIMD SIMD
fork-join
SIMD SIMD SIMD
Intel TBB flow graph design of STAC-A2
… … …
Token Pool
Start
Node Greek
Task 1
Greek
Task N-1
Greek
Task N
Greek
Result
Collector
Greek
# Tokens < # Tasks
5 Assets -> N ~ 170
Join
Join
o
o
o
Pricer
Distributor Distributor
RNG
17
The Fork-Join & SIMD layers
for (unsigned int i = 0; i < nPaths; ++i){
double mV[nTimeSteps];
double mY[nTimeSteps];
…
for (unsigned int t = 0; t < nTimeSteps; ++t){
double currState = mY[t] ;
….
double logSpotPrice = func(currState, …);
mY[t+1] = logSpotPrice * A[t];
mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t];
price[i][t] = logSpotPrice*D[t] +E[t] * mV[t];
}
}
Same code runs on Intel Xeon and Intel Xeon Phi
Fork-Join,Composable
withFlowGraph
for (unsigned i = 0; i < nPaths; ++i)
{
double mV[nTimeSteps];
double mY[nTimeSteps];
…..
for (unsigned int t = 0; t < nTimeSteps; ++t){
double currState = mY[t] ; // Backward dependency
….
double logSpotPrice = func(currState, …);
mY[t+1] = logSpotPrice * A[t];
mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t];
price[i][t] = logSpotPrice*D[t] +E[t] * mV[t];
}
}
SIMDLayer
18
The Fork-Join & SIMD layers
tbb::parallel_for(blocked_range<int>(0, nPaths, 256),
[&](const blocked_range<int>& r) {
const int block_size = r.size();
double mV[nTimeSteps][block_size];
double mY[nTimeSteps][block_size];
…
for (unsigned int t = 0; t < nTimeSteps; ++t){
for (unsigned p = 0; i < block_size; ++p)
{
double currState = mY[t][p] ;
….
double logSpotPrice = func(currState, …);
mY[t+1][p] = logSpotPrice * A[t];
mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p];
}
}
}
Same code runs on Intel Xeon and Intel Xeon Phi
Fork-Join,Composable
withFlowGraph
for (unsigned i = 0; i < nPaths; ++i)
{
double mV[nTimeSteps];
double mY[nTimeSteps];
…..
for (unsigned int t = 0; t < nTimeSteps; ++t){
double currState = mY[t] ; // Backward dependency
….
double logSpotPrice = func(currState, …);
mY[t+1] = logSpotPrice * A[t];
mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t];
price[i][t] = logSpotPrice*D[t] +E[t] * mV[t];
}
}
SIMDLayer
19
The Fork-Join & SIMD layers
tbb::parallel_for(blocked_range<int>(0, nPaths, 256),
[&](const blocked_range<int>& r) {
const int block_size = r.size();
double mV[nTimeSteps][block_size];
double mY[nTimeSteps][block_size];
…
for (unsigned int t = 0; t < nTimeSteps; ++t){
#pragma omp simd
for (unsigned p = 0; i < block_size; ++p)
{
double currState = mY[t][p] ;
….
double logSpotPrice = func(currState, …);
mY[t+1][p] = logSpotPrice * A[t];
mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p];
}
}
}
Same code runs on Intel Xeon and Intel Xeon Phi
Fork-Join,Composable
withFlowGraph
for (unsigned i = 0; i < nPaths; ++i)
{
double mV[nTimeSteps];
double mY[nTimeSteps];
…..
for (unsigned int t = 0; t < nTimeSteps; ++t){
double currState = mY[t] ; // Backward dependency
….
double logSpotPrice = func(currState, …);
mY[t+1] = logSpotPrice * A[t];
mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t];
price[i][t] = logSpotPrice*D[t] +E[t] * mV[t];
}
}
SIMDLayer
20
The Fork-Join & SIMD layers
#pragma offload_attribute(push, target(mic))
tbb::parallel_for(blocked_range<int>(0, nPaths, 256),
[&](const blocked_range<int>& r) {
const int block_size = r.size();
double mV[nTimeSteps][block_size];
double mY[nTimeSteps][block_size];
…
for (unsigned int t = 0; t < nTimeSteps; ++t){
#pragma omp simd
for (unsigned p = 0; i < block_size; ++p)
{
double currState = mY[t][p] ;
….
double logSpotPrice = func(currState, …);
mY[t+1][p] = logSpotPrice * A[t];
mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p];
}
}
}
#pragma offload_attribute(pop)
Same code runs on Intel Xeon and Intel Xeon Phi
Fork-Join,Composable
withFlowGraph
for (unsigned i = 0; i < nPaths; ++i)
{
double mV[nTimeSteps];
double mY[nTimeSteps];
…..
for (unsigned int t = 0; t < nTimeSteps; ++t){
double currState = mY[t] ; // Backward dependency
….
double logSpotPrice = func(currState, …);
mY[t+1] = logSpotPrice * A[t];
mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t];
price[i][t] = logSpotPrice*D[t] +E[t] * mV[t];
}
}
SIMDLayer
21
Heterogeneous code sample from STAC-A2
#pragma offload_attribute(push, target(mic))
typedef execution_node < tbb::flow::tuple<std::shared_ptr<GreekResults>, device_token_t >, double>
execution_node_theta_t;
…
void CreateGraph(…) {
…
theta_node = std::make_shared<execution_node_theta_t>(_g,
[arena, pWS, randoms](const std::shared_ptr<GreekResults>&, const device_token_t& t) -> double {
double pv = 0.;
std::shared_ptr<ArrayContainer<double>> unCorrRandomNumbers;
randoms->try_get(unCorrRandomNumbers);
const double deltaT = 1.0 / 100.0;
pv = f_scenario_adj<false>(pWS->r, …, pWS->A, unCorrRandomNumbers);
return pv;
}
, true));
…
}
#pragma offload_attribute(pop)
Same code executed on Xeon and Xeon Phi, Enabled by Intel® Compiler
22
STAC A2:
Increments in HW architecture and programmability
Intel Xeon
processor
E5 2697-V2
Intel Xeon
processor
E5 2697-V2
Intel Xeon E5
2697-V2 +
Xeon Phi
Intel Xeon E5
2697-V3
Intel Xeon E5
2697-V3+
Xeon Phi
Intel Xeon E5
2697-V3+
2*Xeon Phi
Intel Xeon
Phi 7220
2013 2014 2014 2014 2014 2015 2016
cores 24 24 24+61 36 36+61 36+122 68
Threads 48 48 48+244 72 72+244 72+488 272
vectors 256 256 256+512 256 256+512 256+2*512 512
Parallelization OpenMP TBB TBB TBB TBB TBB TBB
Vectorization #SIMD OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP
Heterogeneity N/A N/A OpenMP N/A OpenMP TBB N/A
Greek time 4.8 1.0 0.63 0.81 0.53 0.216 0.22
Intel Xeon Phi
7220
Cluster
2016
68 x ?
488 x ?
2*512
TBB
OpenMP
TBB
???
1st Heterogeneous
Implementation
Dynamic Load
Balancing between
3 devices Same user developed code
23
Summary
Developing applications in an environment with distributed/heterogeneous
hardware and fragmented software ecosystem is challenging
 3 levels of parallelism – task , fork-join & SIMD
 Intel TBB flow graph coordination layer allows task distribution & dynamic
load balancing. Same user code base:
– flexibility in mix of Xeon and Xeon Phi, just change tokens
– TBB for fork-join is portable across Xeon and Xeon Phi
– OpenMP 4.0 vectorization is portable across Xeon and Xeon Phi
24
Next Steps
Call For Action
TBB distributed flow graph is still evolving
We are inviting collaborators for: applications & communication layers
evgeny.fiksman@intel.com
sergey.vinogradov@intel.com
michaelj.voss@intel.com
25
Michael Voss
michaelj.voss@intel.com
www.intel.com/hpcdevcon
Special Intel TBB 10th Anniversary issue of
Intel’s The Parallel Universe Magazine
https://software.intel.com/en-us/intel-parallel-universe-magazine
27
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products.
Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture
are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
28
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Programming
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Programming

Weitere ähnliche Inhalte

Was ist angesagt?

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhereinside-BigData.com
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
SFO15-BFO2: Reducing the arm linux kernel size without losing your mindSFO15-BFO2: Reducing the arm linux kernel size without losing your mind
SFO15-BFO2: Reducing the arm linux kernel size without losing your mindLinaro
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group Ganesan Narayanasamy
 

Was ist angesagt? (20)

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
SFO15-BFO2: Reducing the arm linux kernel size without losing your mindSFO15-BFO2: Reducing the arm linux kernel size without losing your mind
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
 

Ähnlich wie Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Programming

Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuEstelaJeffery653
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Windows Developer
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015哲也 廣田
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processorToru Nishimura
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
 
Implementing an interface in r to communicate with programmable fabric in a x...
Implementing an interface in r to communicate with programmable fabric in a x...Implementing an interface in r to communicate with programmable fabric in a x...
Implementing an interface in r to communicate with programmable fabric in a x...Vincent Claes
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAGanesan Narayanasamy
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdfJunZhao68
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 

Ähnlich wie Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Programming (20)

Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
GCF
GCFGCF
GCF
 
IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
 
Implementing an interface in r to communicate with programmable fabric in a x...
Implementing an interface in r to communicate with programmable fabric in a x...Implementing an interface in r to communicate with programmable fabric in a x...
Implementing an interface in r to communicate with programmable fabric in a x...
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 

Mehr von Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision SlidesIntel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Intel® Software
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Software
 
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...Intel® Software
 
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Intel® Software
 

Mehr von Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
 
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
 
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
 

Kürzlich hochgeladen

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Programming

  • 1.
  • 2. Evgeny Fiksman, Sergey Vinogradov and Michael Voss Intel Corporation November 2016
  • 3. Intel® Threading Building Blocks (Intel® TBB) Celebrating it’s 10 year anniversary in 2016! A widely used C++ template library for parallel programming What Parallel algorithms and data structures Threads and synchronization primitives Scalable memory allocation and task scheduling Benefits Is a library-only solution that does not depend on special compiler support Is both a commercial product and an open-source project Supports C++, Windows*, Linux*, OS X*, Android* and other OSes Commercial support for Intel® AtomTM, CoreTM, Xeon® processors and for Intel® Xeon PhiTM coprocessors http://threadingbuildingblocks.org http://software.intel.com/intel-tbb 3
  • 4. Applications often contain three levels of parallelism Task Parallelism / Message Passing fork-join SIMD SIMD SIMD fork-join SIMD SIMD SIMD 4
  • 5. Intel® Threading Building Blocks threadingbuildingblocks.org Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch. Concurrent Containers Concurrent access, and a scalable alternative to serial containers with external locking Task Scheduler Sophisticated work scheduling engine that empowers parallel algorithms and flow graph Threads OS API wrappers Miscellaneous Thread-safe timers and exception classes Memory Allocation Scalable memory manager and false-sharing free allocators Synchronization Primitives Atomic operations, a variety of mutexes with different properties, condition variables Flow Graph A set of classes to express parallelism as a graph of compute dependencies and/or data flow Parallel algorithms and data structures Threads and synchronization Memory allocation and task scheduling Thread Local Storage Unlimited number of thread-local variables 5
  • 6. Mandelbrot Speedup Intel® Threading Building Blocks (Intel® TBB) parallel_for( 0, max_row, [&](int i) { for (int j = 0; j < max_col; j++) p[i][j]=mandel(Complex(scale(i),scale(j)),depth); } ); int mandel(Complex c, int max_count) { int count = 0; Complex z = 0; for (int i = 0; i < max_count; i++) { if (abs(z) >= 2.0) break; z = z*z + c; count++; } return count; } Parallel algorithm Use C++ lambda functions to define function object in-line Task is a function object 6
  • 7. Intel Threading Building Blocks flow graph Efficient implementation of dependency graph and data flow algorithms Design for shared memory application Enables developers to exploit parallelism at higher levels graph g; continue_node< continue_msg > h( g, []( const continue_msg & ) { cout << “Hello “; } ); continue_node< continue_msg > w( g, []( const continue_msg & ) { cout << “Worldn“; } ); make_edge( h, w ); h.try_put(continue_msg()); g.wait_for_all(); Hello World 7
  • 8. Intel TBB Flow Graph node types: Functional f() f() f(x) f(x) source_node continue_node function_node multifunction_node Buffering buffer_node queue_node priority_queue_node sequencer_node 1 023 Split / Join queueing join reserving join tag matching join split_node indexer_node Other broadcast_node write_once_node overwrite_node limiter_node 8
  • 9. An example feature detection algorithm buffer get_next_image preprocess detect_with_A detect_with_B make_decision Can express pipelining, task parallelism and data parallelism 9
  • 10. Heterogeneous support in Intel® TBB Intel TBB as a coordination layer for heterogeneity that provides flexibility, retains optimization opportunities and composes with existing models Intel TBB as a composability layer for library implementations • One threading engine underneath all CPU-side work Intel TBB flow graph as a coordination layer • Be the glue that connects hetero HW and SW together • Expose parallelism between blocks; simplify integration + Intel® Threading Building Blocks OpenVX* OpenCL* COI/SCIF DirectCompute* Vulkan* …. FPGAs, integrated and discrete GPUs, co-processors, etc… 1
  • 11. Feature Description Diagram async_node<Input,Output Basic building block. Enables async communication from a single/isolated node to an async activity. User responsible for managing communication. Graph runs on host. async_msg<T> Available as preview feature Basic building block. Enables async communication with chaining across graph nodes. User responsible for managing communication. Graph runs on the host. Support for Heterogeneous Programming in Intel TBB So far all support is within the flow graph API 11
  • 12. async_node example • Allows the data flow graph to offload data to any asynchronous activity and receive the data back to continue execution on the CPU async_node makes coordinating with any model easier and efficient 12 igpu
  • 13. Feature Description Diagram streaming_node Available as preview feature Higher level abstraction for streaming models; e.g. OpenCL, Direct X Compute, GFX, etc.... Users provide Factory that describes buffers, kernels, ranges, device selection, etc… Uses async_msg so supports chaining. Graph runs on the host. opencl_node Available as preview feature A specialization of streaming_node for OpenCL. User provides OpenCL program and kernel and runtime handles initialization, buffer management, communications, etc.. Graph runs on host. Support for Heterogeneous Programming in Intel TBB So far all support is within the flow graph API 13
  • 14. Proof-of-concept: distributor_node NOTE: async_node and composite_node are released features; distributor_node is a proof-of-concept 14
  • 15. An example application: STAC-A2* The STAC-A2 Benchmark suite is the industry standard for testing technology stacks used for compute-intensive analytic workloads involved in pricing and risk management. STAC-A2 is a set of specifications  For Market-Risk Analysis, proxy for real life risk analytic and computationally intensive workloads  Customers define the specifications  Vendors implement the code  Intel first published the benchmark results in Supercomputing’12 – http://www.stacresearch.com/SC12_submission_stac.pdf – http://sc12.supercomputing.org/schedule/event_detail.php?evid=wksp138 STAC-A2 evaluates the Greeks For American-style options  Monte Carlo based Heston Model with Stochastic Volatility  Greeks describe the sensitivity of price of options to changes in parameters of the underlying market – Compute 7 types of Greeks, ex: Theta – sensitivity to the passage of time, Rho – sensitivity for the interest rate * “STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center LLC. 15
  • 16. STAC-A2 Implementation Overview • Implemented with: • Intel TBB flow graph for task distribution • Intel TBB parallel algorithms for for-join constructs • Intel Compiler & OpenMP 4.0 for vectorization • Intel® Math Kernel Library (Intel® MKL) for RND generation and Matrix operations • Uses asynchronous support in flow graph to implement “Distributor Node” and offload to the Intel Xeon Phi coprocessor - heterogeneity • Using a token-based approach for dynamic load balancing between the main CPU and coprocessors Application/STAC-A2 TBB Flow Graph Distributor Node Communication Infrastructure (Intel® MPSS) TBB Scheduler TBB Flow Graph TBB Scheduler 2x Intel Xeon E5 2697 v3 Intel Xeon Phi 7120P Intel Xeon Phi 7120P 16 Task Parallelism / Message Passing fork-join SIMD SIMD SIMD fork-join SIMD SIMD SIMD
  • 17. Intel TBB flow graph design of STAC-A2 … … … Token Pool Start Node Greek Task 1 Greek Task N-1 Greek Task N Greek Result Collector Greek # Tokens < # Tasks 5 Assets -> N ~ 170 Join Join o o o Pricer Distributor Distributor RNG 17
  • 18. The Fork-Join & SIMD layers for (unsigned int i = 0; i < nPaths; ++i){ double mV[nTimeSteps]; double mY[nTimeSteps]; … for (unsigned int t = 0; t < nTimeSteps; ++t){ double currState = mY[t] ; …. double logSpotPrice = func(currState, …); mY[t+1] = logSpotPrice * A[t]; mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t]; price[i][t] = logSpotPrice*D[t] +E[t] * mV[t]; } } Same code runs on Intel Xeon and Intel Xeon Phi Fork-Join,Composable withFlowGraph for (unsigned i = 0; i < nPaths; ++i) { double mV[nTimeSteps]; double mY[nTimeSteps]; ….. for (unsigned int t = 0; t < nTimeSteps; ++t){ double currState = mY[t] ; // Backward dependency …. double logSpotPrice = func(currState, …); mY[t+1] = logSpotPrice * A[t]; mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t]; price[i][t] = logSpotPrice*D[t] +E[t] * mV[t]; } } SIMDLayer 18
  • 19. The Fork-Join & SIMD layers tbb::parallel_for(blocked_range<int>(0, nPaths, 256), [&](const blocked_range<int>& r) { const int block_size = r.size(); double mV[nTimeSteps][block_size]; double mY[nTimeSteps][block_size]; … for (unsigned int t = 0; t < nTimeSteps; ++t){ for (unsigned p = 0; i < block_size; ++p) { double currState = mY[t][p] ; …. double logSpotPrice = func(currState, …); mY[t+1][p] = logSpotPrice * A[t]; mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p]; price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p]; } } } Same code runs on Intel Xeon and Intel Xeon Phi Fork-Join,Composable withFlowGraph for (unsigned i = 0; i < nPaths; ++i) { double mV[nTimeSteps]; double mY[nTimeSteps]; ….. for (unsigned int t = 0; t < nTimeSteps; ++t){ double currState = mY[t] ; // Backward dependency …. double logSpotPrice = func(currState, …); mY[t+1] = logSpotPrice * A[t]; mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t]; price[i][t] = logSpotPrice*D[t] +E[t] * mV[t]; } } SIMDLayer 19
  • 20. The Fork-Join & SIMD layers tbb::parallel_for(blocked_range<int>(0, nPaths, 256), [&](const blocked_range<int>& r) { const int block_size = r.size(); double mV[nTimeSteps][block_size]; double mY[nTimeSteps][block_size]; … for (unsigned int t = 0; t < nTimeSteps; ++t){ #pragma omp simd for (unsigned p = 0; i < block_size; ++p) { double currState = mY[t][p] ; …. double logSpotPrice = func(currState, …); mY[t+1][p] = logSpotPrice * A[t]; mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p]; price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p]; } } } Same code runs on Intel Xeon and Intel Xeon Phi Fork-Join,Composable withFlowGraph for (unsigned i = 0; i < nPaths; ++i) { double mV[nTimeSteps]; double mY[nTimeSteps]; ….. for (unsigned int t = 0; t < nTimeSteps; ++t){ double currState = mY[t] ; // Backward dependency …. double logSpotPrice = func(currState, …); mY[t+1] = logSpotPrice * A[t]; mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t]; price[i][t] = logSpotPrice*D[t] +E[t] * mV[t]; } } SIMDLayer 20
  • 21. The Fork-Join & SIMD layers #pragma offload_attribute(push, target(mic)) tbb::parallel_for(blocked_range<int>(0, nPaths, 256), [&](const blocked_range<int>& r) { const int block_size = r.size(); double mV[nTimeSteps][block_size]; double mY[nTimeSteps][block_size]; … for (unsigned int t = 0; t < nTimeSteps; ++t){ #pragma omp simd for (unsigned p = 0; i < block_size; ++p) { double currState = mY[t][p] ; …. double logSpotPrice = func(currState, …); mY[t+1][p] = logSpotPrice * A[t]; mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p]; price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p]; } } } #pragma offload_attribute(pop) Same code runs on Intel Xeon and Intel Xeon Phi Fork-Join,Composable withFlowGraph for (unsigned i = 0; i < nPaths; ++i) { double mV[nTimeSteps]; double mY[nTimeSteps]; ….. for (unsigned int t = 0; t < nTimeSteps; ++t){ double currState = mY[t] ; // Backward dependency …. double logSpotPrice = func(currState, …); mY[t+1] = logSpotPrice * A[t]; mV[t+1] = logSpotPrice * B[t] + C[t] * mV[t]; price[i][t] = logSpotPrice*D[t] +E[t] * mV[t]; } } SIMDLayer 21
  • 22. Heterogeneous code sample from STAC-A2 #pragma offload_attribute(push, target(mic)) typedef execution_node < tbb::flow::tuple<std::shared_ptr<GreekResults>, device_token_t >, double> execution_node_theta_t; … void CreateGraph(…) { … theta_node = std::make_shared<execution_node_theta_t>(_g, [arena, pWS, randoms](const std::shared_ptr<GreekResults>&, const device_token_t& t) -> double { double pv = 0.; std::shared_ptr<ArrayContainer<double>> unCorrRandomNumbers; randoms->try_get(unCorrRandomNumbers); const double deltaT = 1.0 / 100.0; pv = f_scenario_adj<false>(pWS->r, …, pWS->A, unCorrRandomNumbers); return pv; } , true)); … } #pragma offload_attribute(pop) Same code executed on Xeon and Xeon Phi, Enabled by Intel® Compiler 22
  • 23. STAC A2: Increments in HW architecture and programmability Intel Xeon processor E5 2697-V2 Intel Xeon processor E5 2697-V2 Intel Xeon E5 2697-V2 + Xeon Phi Intel Xeon E5 2697-V3 Intel Xeon E5 2697-V3+ Xeon Phi Intel Xeon E5 2697-V3+ 2*Xeon Phi Intel Xeon Phi 7220 2013 2014 2014 2014 2014 2015 2016 cores 24 24 24+61 36 36+61 36+122 68 Threads 48 48 48+244 72 72+244 72+488 272 vectors 256 256 256+512 256 256+512 256+2*512 512 Parallelization OpenMP TBB TBB TBB TBB TBB TBB Vectorization #SIMD OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP Heterogeneity N/A N/A OpenMP N/A OpenMP TBB N/A Greek time 4.8 1.0 0.63 0.81 0.53 0.216 0.22 Intel Xeon Phi 7220 Cluster 2016 68 x ? 488 x ? 2*512 TBB OpenMP TBB ??? 1st Heterogeneous Implementation Dynamic Load Balancing between 3 devices Same user developed code 23
  • 24. Summary Developing applications in an environment with distributed/heterogeneous hardware and fragmented software ecosystem is challenging  3 levels of parallelism – task , fork-join & SIMD  Intel TBB flow graph coordination layer allows task distribution & dynamic load balancing. Same user code base: – flexibility in mix of Xeon and Xeon Phi, just change tokens – TBB for fork-join is portable across Xeon and Xeon Phi – OpenMP 4.0 vectorization is portable across Xeon and Xeon Phi 24
  • 25. Next Steps Call For Action TBB distributed flow graph is still evolving We are inviting collaborators for: applications & communication layers evgeny.fiksman@intel.com sergey.vinogradov@intel.com michaelj.voss@intel.com 25
  • 27. Special Intel TBB 10th Anniversary issue of Intel’s The Parallel Universe Magazine https://software.intel.com/en-us/intel-parallel-universe-magazine 27
  • 28. Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 28